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Abstract 


Large real-world graph (a.k.a network, relational) data are omnipresent, in online 
media, businesses, science, and the government. Analysis of these massive graphs is 
crucial, in order to extract descriptive and predictive knowledge with many commer¬ 
cial, medical, and environmental applications. In addition to its general structure, 
knowing what stands out, i.e. anomalous or novel, in the data is often at least, or 
even more important and interesting. 

In this thesis, we build novel algorithms and tools for mining and modeling 
large-scale graphs, with a focus on: (1) Graph pattern mining: we discover sur¬ 
prising patterns that hold across diverse real-world graphs, such as the “fortification 
effect” (e.g. the more donors a candidate has, the super-linearly more money s/he 
will raise), dynamics of connected components over time, and power-laws in human 
communications, (2) Graph modeling : we build generative mathematical models, 
such as the RTG model based on “random typing” that successfully mimics a long 
list of properties that real graphs exhibit, (3) Graph anomaly detection : we develop a 
suite of algorithms to spot abnormalities in various conditions; for (a) plain weighted 
graphs, (b) binary and categorical attributed graphs, (c) time-evolving graphs, and 
(d) sensemaking and visualization of anomalies. 
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nine laws. See Table 2.2 and Chapter 3 for details on these laws.94 
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9.7 (a) Computation time of computing egonet features vs. number of edges in En¬ 
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ing number of edges, whereas it decreases with pruning, (b) Time vs. accuracy. 

Effect of pruning on accuracy of finding top anomalies as in the original ranking 
before pruning. New rankings are scored using Normalized Cumulative Dis¬ 
counted Gain. Pruning reduces time for both Node-Iterator and Eigen-Triangle 
for different number of eigenvalues while keeping accuracy at as high as~l and 
-.9, respectively.126 

10.1 PICS on YOUTUBE finds clusters of users with similar connectivity and at¬ 

tribute coherence. Left: the adjacency matrix (users-to-users); right: the at¬ 
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10.5 PICS on CALL. Notice 3 major node groups of casual users, business and grad 
students, respectively as well as a group of size 1, receiving many calls, probably 

a call sendee center. .143 
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Chapter 1 
Introduction 


1.1 Motivation and Applications 


In this thesis, our focus is mining patterns and anomalies in large, time-varying, real-world 
graphs using scalable algorithms and tools. Graphs provide a powerful machinery for modeling 
many types of relations for both natural and human-made structures in physical, biological, and 
social systems. As a result, graph data has become ubiquitous -Internet, social networks, food 
web, protein networks and many more. These real-world graphs have posed a wealth of fascinat¬ 
ing research questions and high-impact applications. For example, on-line social networks have 
been generating billions of dollars of revenues; anomaly detection in terrorism networks as well 
as computer networks is vital for national security; blogs have become an important medium 
of spreading information and ideas on World Wide Web; brains networks and gene regulatory 
networks can help us understand how our brains and cells work. 

Graphs also serve as powerful tools for solving real-world problems. In other words, many 
problems of practical interest can be represented as natural problems on graphs. In history, the 
well-known “Seven Bridges of Koenigsberg” problem is the first real-world problem that was 
solved by the study of graphs. Introduced by Leonhard Euler in 1735, the problem asks for 
a walk through the city that would cross each bridge once and only once. Representing land 
as nodes and bridges as edges, Euler reformulated and analyzed the problem in abstract terms, 
laying the foundations of graph theory. 

Since Euler, researchers have cast a tremendous number of real-world problems in diverse do¬ 
mains as graph problems. For example, graphs are widely used in sociology as a way to explore 
the spread of information or diseases among individuals. In biology graphs are used to represent 
regions of species and their migration paths to study breeding patterns. Graph-centric methods 
are also useful for the study of molecules in chemistry, with applications like chemical similar¬ 
ity and largest common substructure search. Other applications include community detection, 
peer recommendation, ad targeting, hijack and phishing detection, etc. In summary, graphs have 
proven to be a convenient abstraction to reason about numerous problems, and graph algorithms 
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are at the heart of various problems in many diverse fields. 

As graphs are among the most ubiquitous models for real-world data, they rapidly grow in size 
as more and more data become available. In 1735, the graph that Euler studied had only four 
nodes and seven edges. Today, social graphs like Facebook and Twitter, by astounding growth in 
the past decade, have reached hundreds of millions of users around the world. The World Wide 
Web contains at least nine billion pages. Through impressive technical advances, biologists are 
gathering more and more genomic data at lower costs. These huge amounts of real-world data 
present even higher impact opportunities. With this new phenomenon, one of the biggest chal¬ 
lenges lies in interpreting the huge volume of data being generated and in extracting descriptive, 
predictive, and commercially or medically useful information from it. 

Under graph mining, our interests group into three interrelated topics: 

1. Graph patterns'. We focus on identifying regularities that hold across real-world graphs, 
and understanding their formation and evolution. Such patterns prove to be useful for var¬ 
ious data mining tasks such as (1) modeling; by providing intuition into the mechanisms 
by which networks form and evolve, (2) summarization; by providing a compact represen¬ 
tation, (3) forecasting; by representing continuing trends, and (4) anomaly detection; by 
revealing data instances that deviate significantly from the observed trends. 

2. Graph generators'. We focus on developing generative models that produce synthetic 
graphs that can mimic real graphs. Generators are useful in providing a laboratory en¬ 
vironment for generating synthetic graphs with certain size and characteristics, especially 
for simulation studies. 

3. Anomaly detection : Our goal is to build models that capture the norms in graph data and 
later to exploit these models to detect and characterize anomalies and events that signifi¬ 
cantly deviate from normal behavior. This is a crucial task, with numerous applications in 
finance, security, health care, law enforcement, and so on. 

Next, we give the thesis statement and an overview of how the thesis is organized. We also list 
the problems we address and provide a summary of our contributions. 


1.2 Thesis Statement and Overview 


Real-world networks exhibit regularities, which then help reveal anomalies. More specif¬ 
ically, in this thesis we investigate the regularities (patterns) in the formation, structure, and 
evolution of real-world networks, develop generative models (generators) that explain and cap¬ 
ture these regularities and that mimic real-world networks. We then build new methods to spot 
irregularities (anomalies and events) in data, exploiting these regularities in addition to using 
several novel graph mining and compression techniques. 

Following the thesis statement, we organize the thesis into three parts. The outline is shown in 
Table 1.1. 
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Observations 


Part I: 
Graph 
Patterns 

• Chapter 3 Patterns in graph topology [Akoglu et al., 2008; Kang et al., 
2010a; Mcglohon et al., 2008] 

• Chapter 4 Patterns in human communications [Akoglu et al., 2012a; Du 
et al., 2009; Vaz de Melo et al., 2010]. 

Models 

Part II: 

Graph 

Generators 

• Chapter 6 Models of graph topology [Akoglu and Faloutsos, 2009; 
Akoglu et al., 2008] 

• Chapter 7 Models of human communications [Du et al., 2009]. 

Algorithms 

Part III: 

Anomaly 

Detection 

• Chapter 9 Anomalies in plain graphs [Akoglu et al., 2010]. 

• Chapters 10,11 Anomalies in attributed graphs [Akoglu et al., 2012b,c]. 

• Chapter 12 Events in time-evolving graphs [Akoglu and Faloutsos, 2010]. 

• Chapter 13 Sensemaking for graph anomalies [Akoglu et al., 2012d]. 


Table 1.1: Outline of thesis with references to chapters. 


In the first part, Patterns of Networks, we focus on the structure and dynamics of real-world 
graphs. We are interested in understanding how new nodes and edges arrive to the network and 
the structure they generate at large, how they form components and how these components grow 
and merge over time, and the regularities associated with weights on edges. Another focus of this 
part is the study of human communications, where we analyze the social circles humans belong 
to, how often they reciprocate, and their calling behavior. The questions we address are: 

What are the typical patterns in the structure and dynamics of real-world networks from a 
large collection of diverse domains (e.g. political campaign donations, computer network 
traffic, online social network, large Web graph)? (Chapter 3) 

What are the typical patterns in human communications? How large are our social circles? 
How reciprocal are we, and what does it depend on? How long are our phone calls, and 
how can we summarize them? (Chapter 4) 

Impact 

We are among the first to discover patterns in weighted graphs and in the connected components 
of graphs. 

• “Rebel probability” and oscillating/constant-size connected components: We discover 
that the smaller connected components grow up to a certain size at which they merge 
into the largest component (§3.1.3), and that newcomers rebel this largest component with 
probability that drops exponentially with their degree (§3.1.5). 
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• “Fortification effect” and other non-linear weighted-network patterns: Our findings 
include the surprising law of fortification (§3.2.2), which formulates a power law relation 
between edges and weights in real networks. 

• Power-laws in human communications: We contribute the first model (§4.2.3) that suc¬ 
cinctly captures the reciprocity behavior in human communications, and a new model (§4.3.2) 
that summarizes call duration behavior of millions of humans. Both models consistently 
provide the best fit to real data, among popular competitors (Pareto, Lognormal, Yule). 

In the second part, Generative Models of Networks, we create models that produce synthetic 
but realistic graphs. First, we develop models that mimic general real-world graphs. Second, we 
focus particularly on human communication behavior and build a model to mimic this type of 
behavior. The main questions of interest are: 

How could we have a realistic generative model that will produce synthetic graphs that 
look like real, i.e. graphs that obey all the patterns we know so far, as well as the newly 
discovered ones for dynamic and weighted graphs? (Chapter 6) 

How could we design a realistic and intuitive generative model that will naturally repro¬ 
duce real human-to-human communication behavior? (Chapter 7) 

Impact 

We are among the first to propose a graph generator for weighted graphs that also captures all 
other related properties. 

• Realistic generative models for graphs: Our first model for general graphs, the Recursive 
Tensor Model (§6.1), provably models certain graph laws and is the first to model bursty 
traffic. Our later model, the Random Typing Graphs (RTG) (§6.2), unlike previous models, 
mimics all eleven of the known properties of real-world graphs (see Table 2.2 for a list). 
As such, RTG won the Best Knowledge Discovery Paper award in the Conference on 
Principles and Practice of Knowledge Discovery (PKDD) 2009. 

• Utility-driven model for human communications: We designed a realistic utility-driven 
graph generator, the Pay-and-Call model (§7.1), for modeling human communication be¬ 
havior. This agent-based model allows us to understand the local mechanisms that play 
role in the formation of communication networks and to answer what-if scenarios. 

In the last and third part, Anomaly Detection, we focus on the anomaly and event detection 
problem. We exploit compression based techniques, model fitting, and graph mining techniques 
to address the following problems: anomalous node detection in plain graphs, cluster and outlier 
detection in attributed graphs, and event detection in time series of graphs. We also develop 
visualization and sensemaking tools for the end user or analyst of the detected anomalies. The 
list of problems under the anomaly detection setting are: 

Given a large, weighted graph, how can we find anomalies? Which rules should be vio¬ 
lated, before we label a node as an anomaly? (Chapter 9) 
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Given a graph with node attributes, how can we find meaningful patterns such as clusters 
of nodes and clusters of attributes, as well as bridges and outliers? (Chapter 10) 

Given a database with categorical attributes, how can we find meaningful patterns that 
succinctly describe the data and spot outliers? (Chapter 11) 

At what points in time many of the nodes in a given time-varying graph change their 
behavior significantly? Can we attribute the change to specific nodes, that is, can we 
characterize which nodes change in behavior the most? (Chapter 12) 

How can we use the network structure to summarize a set of (anomalous) nodes, by parti¬ 
tioning them such that for each part we have a simple tour connecting the grouped nodes, 
while nodes in different parts are not easily reachable? In other words, how can we sum¬ 
marize the set of (anomalous) nodes for easy sensemaking? (Chapter 13) 

Impact 

We develop a collection of novel methods to detect anomalies in a graph under various settings: 
for graphs for which we only know the graph structure, for richer graphs with attributes, and for 
graphs changing over time. These methods serve as an ensemble that can be employed under 
different or changing conditions. 

• Automatic anomaly detection in plain graphs: We are among the first to detect outlier 
nodes in graph data (§9.1), rather than in a collection of data points. As such, our Odd- 
B ALL won the Best Paper award in the Pacific-Asia Conference on Knowledge Discovery 
and Data Mining (PAKDD) 2010. It has been integrated into Carnegie Mellon’s Pegasus 
Tera-Scale Graph Mining system, and is a main component of ADAMS (Anomaly Detec¬ 
tion at Multi Scale) research initiative supported by DARPA. 

• Parameter-free mining and clustering in attributed graphs: We introduced a novel 
clustering model, called PICS (§10.1), to find groups of nodes in attributed graphs. PICS spots 
bridge-nodes with connections across clusters and outlier-nodes that do not belong well to 
any cluster. We also proposed a new approach, called CompreX (§11.1), to identify 
anomalies in attributed graphs, that exploits pattern-based compression. Compression has 
been primarily used in communications theory and databases; our work is one of few to 
explore compression for data mining tasks. This work and its extension to time-evolving 
graphs led to two patents (filed with high-impact score) at IBM Research Labs. 

• Automatic event detection and “attribution” in time-evolving graphs: We developed 
an algorithm (§12.1) to quantify “change” for time-varying graphs that flags “events” at 
which the change is significant. It also does “attribution”; to point out the specific nodes 
that contribute most to the change. This work was developed for the Network Science 
Collaborative Technology Alliance (NS-CTA) supported by the Army Research Labs. 

We note that all of our methods (1) work for unlabeled and weighted graphs, and (2) operate 
in a completely unsupervised fashion. Moreover we enable “model-driven anomaly charac¬ 
terization”, that is, we provide the reasoning for flagged anomalies, as well as novel tools for 
visualization and sensemaking (§13.1). 
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Part I 


Patterns of Networks 



Chapter 2 
Preliminaries 

2.1 Introduction 


What do real-world networks look like on a global scale ? What are the typical patterns in the 
structure and dynamics of networks at large? For example, how do nodes join a network and 
form edges over time? After an edge forms, what are the patterns associated with the weights on 
the edges? How do the different connected components of a network form and evolve? What are 
the typical patterns in human communications? 

In this part, we begin to answer these questions by studying many networks from diverse do¬ 
mains. The idea is to analyze topological and dynamical features of these networks such as the 
diameter, connected components, edge weights, etc. and how they correlate with each other and 
evolve over time. 

There has been extensive work focusing on static snapshots of unweighted graphs, where fasci¬ 
nating properties have been discovered, the most striking ones being the “small-world” phe¬ 
nomenon [Watts and Strogatz, 1998] (also known as “six degrees of separation” [Milgram, 
1967]) and the power law degree distributions [Barabasi and Albert, 1999; Faloutsos et al., 1999]. 
Time-evolving graphs have attracted attention only recently, where even more fascinating prop¬ 
erties have been discovered, like shrinking diameters, and the so-called densification power law 
[Leskovec et al., 2005b]. 

One central topic of this part is the study of some of the most important properties apparent 
in networks, with a particular emphasis on dynamic networks, as well as some of the newer 
findings with respect to weighted networks. Most analyses in previous work focused on the 
“giant connected component” (GCC), either explicitly or implicitly, and moreover it ignored 
multiple links between nodes or weights on edges. Here we shift our focus to the components 
that are of moderate size but “disconnected” from the GCC of the graph, which we will refer to 
as the “next-largest connected components” (NLCCs). In particular, we study the dynamics of 
how these components form and merge. We also study edge weights, particularly how weighted 
edges are added over time and their correlation with other structural properties. 
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Another main theme of this part is the study of human communication networks. Most previ¬ 
ous work in network science and social network analysis focus on node level degree distribu¬ 
tions [Broder et al., 2000; Faloutsos et al., 1999; Leskovec et al., 2007b], communities [Nuss- 
baum et al., 2010; Satuluri and Parthasarathy, 2009; Tantipathananandh et al., 2007], and triadic 
relations, such as clustering coefficients and triangle closures [Granovetter, 1973]. Here we build 
on triadic relations and study social circles people belong to in human-to-human networks, like 
phone call, text and instant messaging networks. We also study dyadic relations; specifically the 
reciprocity relations among individuals and the associated bivariate distributions. Furthermore, 
we find new patterns with respect to call durations of mobile phone users. 

The questions of interest are: 

• How do networks behave over time ? Does the structure vary as the network grows? In what 
fashion do new entities enter a network? Does the network retain certain graph properties 
as it grows and evolves? Does the graph undergo a “phase transition”, in which its behavior 
suddenly changes? 

• How do the next-largest connected components change over time? One might argue that 
they grow, as new nodes are being added; and their size would probably remain a fixed 
fraction of the size of the GCC. Someone else might counter-argue that they shrink, and 
they eventually get absorbed into the GCC. What is happening, in real graphs? 

• What distributions and patterns do weighted graphs maintain ? How does the distribution 
of weights change over time- do we also observe a densification of weights as well as 
single-edges? How does the distribution of weights relate to the degree distribution? Is the 
edge weight related to the popularity of its adjacent nodes? Is the addition of weight bursty 
over time, or is it uniform? 

• What patterns should we expect in a network of human-to-human interactions? How large 
are our social circles on average? If someone has many contacts, does that indicate popu¬ 
larity? How long are the phone calls of mobile phone users? What are the chances of a call 
to end, given its current duration? What is the best way to summarize the calling behavior 
of a user? What are typical reciprocity relations between individuals? What is reciprocity 
correlated in the network? 

Answering these questions is important to understand how natural graphs form and evolve, and 
to (a) spot anomalous graphs and sub-graphs; (b) answer questions about entities in a network 
and what-if scenarios; and (c) develop understanding on human communication behavior. 

Let’s elaborate on each of the above applications: (a) Spotting anomalies is vital for determining 
abuse or fault in social and computer networks, such as unwanted calls and faulty equipment 
in communication networks, link-spamming in a web graph, fraudulent reputation building in 
e-auction systems [Pandit et al., 2007], detection of dwindling/abnormal social sub-groups in a 
social-networking site like Yahoo-360 (3 60. yahoo . com), Facebook(www. f acebook . com) 
and Linkedln (www. linkedin . com), network intrusion [Lazarevic et al., 2003]. (b) Analyz¬ 
ing network properties is also useful for identifying authorities and search algorithms [Borodin 
et al., 2005; Chakrabarti et al., 1999; Kumar et al., 2006], for discovering the “network value” 
of customers for using viral marketing [Richardson and Domingos, 2002], or to improve recom¬ 
mendation systems [Bell et al., 2007]. What-if scenarios are vital for extrapolation, provisioning 
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and algorithm design: For example if we expect that the number of links will double within 
the next year, we should provision for the appropriate hardware and infrastructure to store and 
process the upcoming queries, (c) Human behavior models provide insight on natural human 
behavior at individual level (e.g. evidence shows that reciprocal relationships are highly proba¬ 
ble to persist in the future [Cesar A. Hidalgo, 2008]) as well as on global behavior at network 
scale (propagation of viruses in computer networks or spreading information and ideas in social 
networks clearly depend on the presence of mutual links in the network). 

The following two chapters in this part are based on work as cited below. 

• Chapter 3 Patterns in graph topology 

[Akoglu et al., 2008; Kang et al., 2010a; Mcglohon et al., 2008] 

• Chapter 4 Patterns in human communications 

[Akoglu et al., 2012a; Du et al., 2009; Vaz de Melo et al., 2010]. 

Before delving into the study of real graph patterns, we next establish the terms and definitions 
we use in the rest of this part, present related work, and introduce the datasets we studied. 


2.2 Definitions 

In this section we provide basic definitions and terms we use. A full list of symbols is listed in 
Table 2.1. 


Networks as Graphs 

A network is typically represented by a graph. Throughout the thesis we use network and graph 
interchangeably. 

A static, unweighted graph G consists of a set of nodes V and a set of edges E: G = (V, E). 
We represent the sizes of V and E as \V\ and \E\. A graph may be directed or undirected- 
fox instance, a phone call may be from one party to another, and will have a directed edge, or a 
mutual friendship may be represented as an undirected edge. 

Graphs may also be weighted , where there may be multiple edges occurring between two nodes 
(e.g. repeated phone calls) or specific edge weights (e.g. monetary amounts for transactions). In 
a weighted graph Q, let e l)3 be the edge between node i and node j. We shall refer to these two 
nodes as the “neighboring nodes” or “incident nodes” of edge e l3 . Let w h3 be the weight on edge 
e;j. The total weight wy of node i is defined as the sum of weights of all its incident edges, that 
is Wi = J2k=i w hk > where di denotes its degree. As we show later, there is a relation between a 
given edge weight w h3 and the weights of its neighboring nodes tv, and w 3 . 

Finally, graphs may be unipartite or multipartite. Most social networks one thinks of are unipartite- 
people in a group, papers in a citation network, etc. However, there may also be multi-partite 
graphs- that is, there are multiple classes of nodes and edges that are only drawn between nodes 
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Symbol 

Description 

£ 

Graph representation of datasets 

V 

Set of nodes for graph Q 

S 

Set of edges for graph Q 

N 

Number of nodes, or V 

E 

Number of edges, or \S\ 

e i,j 

Edge between node i and node j 

W i,j 

Weight on edge e h3 

Wi 

Weight of node i (sum of weights of incident edges) 

A 

0-1 Adjacency matrix of the unweighted graph 

A-W 

Real-value adjacency matrix of the weighted graph 

a i,j 

Entry in matrix A 

Ai 

Principal eigenvalue of unweighted graph 

^1,W 

Principal eigenvalue of weighted graph 


Table 2.1: Table of symbols used in notation. 


of different classes. Bipartite graphs, like the IMDB movie-actor graph, consist of disjoint sets 
of nodes Vj and V 2 , say, for authors and movies, with no edges among nodes of same type. 

We can also represent a network with an adjacency matrix A, where nodes are in rows and 
columns, and numbers in the matrix indicate the existence of edges. For unweighted graphs, all 
entries are 0 or 1; for weighted graphs the adjacency matrix contains the values of the weights. 
Figure 2.1 shows examples of graphs and their adjacency matrices. 

We next introduce other important concepts we use in analyzing these graphs. 


Components and Cliques 

We refer to a connected component in a graph as a set of nodes and edges where there exists a 
path between any two nodes in the set (For directed graphs, this translates to a weakly connected 
component whereas a strongly connected component requires a directed path between its pairs 
of nodes). We find that in real graphs over time, a giant connected component (GCC) forms. 
However, it is also of interest to study the smaller components- when do they choose to join the 
GCC, what size do they reach before doing so, and what does the distribution of component sizes 
look like? We provide answers to these questions in Chapter 3. 

Given a subgraph G t of G, if Vit, v G VGi, 3(u, v ) e EG, then G x is called a complete subgraph 
or a clique of G. Furthermore, if there is no other subgraph Gj that is also a clique of G with 
VGj D VGi, Gi is called a maximal clique of G. In Chapter 4, we treat the maximal cliques in 
a graph as social circles, and study their properties in human communication networks. 
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Figure 2.1: Illustrations of example graphs. On the left is a unipartite, directed, weighted graph 
and the corresponding adjacency matrix. On the right is an undirected, bipartite 
graph and the corresponding adjacency matrix. 


Diameter and Effective Diameter 

For a given (static) graph, its diameter is defined as the maximum distance between any two 
nodes, where distance is the minimum number of hops (i.e., edges that must be traversed) on the 
path from one node to another, ignoring directionality and edge-weights. 

Since the diameter is defined as the maximum -length shortest path between all possible pairs, it 
can easily be hijacked by long chains. Therefore, often the effective diameter is used as a more 
robust metric, which is the 90-percentile of the pairwise distances among all reachable pairs of 
nodes. In other words, the effective diameter is the minimum number of hops in which some 
fraction (usually 90%) of all connected node pairs can be reached [Siganos et al., 2006]. 

Calculating the diameter of a graph is 0(N 2 ). Therefore, we choose to estimate the graph diam¬ 
eter by sampling nodes from the giant component. For s = {1, 2,..., 5}, we choose two nodes at 
random and calculate the distance (using breadth-first search). We then choose to record the 90 
percentile value of distances, so we take the .95 largest recorded value. The distance operation 
is 0(dk), where d is the graph diameter and k the maximum degree of any node- on average 
this is a much smaller cost. Intuitively, the diameter represents how much of a “small world” the 
graph is- how quickly one can get from one “end” of the graph to another [Tauro et al., 2001]. 
Alternative methods to sampling would include ANF [Palmer et al., 2002]. 
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Radius and Effective Radius 


While the diameter is defined over an entire graph, radius is defined for a particular node in the 
graph. Specifically, for a given node, its radius is defined as the maximum of all its distances to 
other nodes in the graph. Similar to effective diameter as described before, its effective radius is 
the 90-percentile of these distances. 


Heavy-tailed Distributions 

While the Gaussian distribution is common in nature, there are many cases where the probability 
of events far to the right of the mean is significantly higher than in Gaussians. In the Internet, 
for example, most routers have a very low degree (perhaps “home” routers), while a few routers 
have extremely high degree (perhaps the “core” routers of the Internet backbone) [Faloutsos 
et al., 1999] Heavy-tailed distributions attempt to model this. They are known as “heavy-tailed” 
because, while traditional exponential distributions have bounded variance (large deviations from 
the mean become nearly impossible), p(x) decays polynomially quickly instead of exponentially 
as x —* oo, creating a “fat tail” for extreme values on the PDF plot. 

One of the more well-known heavy-tailed distributions is the power law distribution. Two vari¬ 
ables x and y are related by a power law when: 

y(x) = Ax ~ 7 ( 2 . 1 ) 

where A and 7 are positive constants -7 is often called the power law exponent. 

A random variable is distributed according to a power law when the probability density function 
(pdf) is given by: 

p(x) = Ax~ 7 , 7 > 1, x > x mm ( 2 . 2 ) 

The extra 7 > 1 requirement ensures that p(x) can be normalized. Power laws with 7 < 1 rarely 
occur in nature, if ever [Newman, 2005]. 

Skewed distributions, such as power laws, occur very often in real-world graphs, as we will 
discuss. Figures 2.2(a) and 2.2(b) show two examples of power laws. 

While power laws appear in a large number of graphs, deviations from a pure power law are also 
observed. Two most common deviations are exponential cutoffs and lognormals. 

Sometimes, the distribution looks like a power law over the lower range of values along the x- 
axis, but decays very fast for higher values. Often, this decay is exponential, and this is usually 
called an exponential cutoff: 

y(x = k)(x e~ k/K k " 7 (2.3) 

where e~ k ^ K is the exponential cutoff term and A ; -7 is the power law term. 

Similar distributions were studied by [Bi et al., 2001], who found that a discrete truncated log¬ 
normal (called the Discrete Gaussian Exponential or “DGX” by the authors) gives a very good 
fit. A lognormal is a distribution whose logarithm is a Gaussian; it looks like a truncated parabola 
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Figure 2.2: Power laws and deviations: Plots (a) and (b) show the in-degree and out-degree 
distributions on a log-log scale for the Epinions graph (an online social network of 
75, 888 people and 508, 960 edges [Richardson and Domingos, 2001]). Both follow 
power-laws. In contrast, plot (c) shows the out-degree distribution of a Clickstream 
graph (a bipartite graph of users and the websites they surf [Montgomery and Falout- 
sos, 2001]), which deviates from the power-law pattern. 


in log-log scales. The DGX distribution has been used to fit the degree distribution of a bipartite 
“clickstream” graph linking websites and users (see Figure 2.2(c)). 

Methods for fitting heavy-tailed distributions are described in [Clauset et al., 2009; Newman, 
2005], 


Burstiness and Entropy Plots 


Human activity, including edge additions in graphs, is often bursty. If the traffic is self-similar , 
then we can measure the burstiness, using the intrinsic, or fractal dimension of the cloud of 
timestamps of edge-additions (or weight-additions). Let AW(f) be the total weight of edges that 
were added during the t-th interval, e.g., the total network flow on day t, among all the machines 
we are observing. 

Among the many methods that measure self-similarity (Hurst exponent, etc. [Schroeder, 1991]), 
we choose the entropy plot [Wang et al., 2002], which plots the entropy H(r) versus the reso¬ 
lution r. The resolution is the scale, that is, at resolution r, we divide our time interval into 2 r 
equal sub-intervals, sum the weight-additions AW(t) in each sub-interval k (k = 1... 2 r ), nor¬ 
malize into fractions p k (= A W(t)/W tota i), and compute the Shannon entropy of the sequence 
p k : H(r ) = — ff, k Pk log 2 Pk- If the plot H(r) is linear in some range of resolutions, the cor¬ 
responding time sequence is said to be fractal in that range, and the slope of the plot is defined 
as the intrinsic {or fractal) dimension D of the time sequence. Notice that a uniform weight- 
addition distribution yields D=l \ a lower value of D corresponds to a more bursty time sequence 
like a Cantor dust [Schroeder, 1991], with a single burst having the lowest 79=0: the intrinsic 
dimension of a point. Also note that the “b-model” [Wang et al., 2002], a variation of the 80-20 
model, generates such self-similar traffic. 
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2.3 Related Work 


For the purpose of organization, we divide related work into those applying to the study of graph 
structure and dynamics and to the study of human communications. 


2.3.1 Previous patterns in structure and dynamics 

Properties of real-world graphs observed to date can be summarized under a “2-by-2” grid struc¬ 
ture: those on static graphs (structural patterns) and those on dynamic graphs (temporal patterns), 
each of which can further be subdivided into two -unweighted and weighted patterns. 

An overview of all known real-world graph properties, including the newer ones that we will 
describe in this thesis, is given in Table 2.2. The patterns highlighted in bold are introduced 
in this thesis in Chapter 3. As one can notice, our focus is mostly on weighted graphs and 
dynamic properties. In this section, we present previously discovered patterns; namely patterns 
in degree distributions, diameter, triangles, eigenvalues, community structure, and graph density. 
We describe those in detail next. 



Unweighted 

Weighted 

Static 

SU-1 Heavy-tailed degree distribution 

SU-2 Small diameter 

SU-3 Triangle Power Law (TPL) 

SU-4 Eigenvalue power law (EPL) 

SU-5 Community structure 

SU-6 Chain-like NLCCs 

S W-1 Edge Weights Power Law (EWPL) 
SW-2 Snapshot Power Law (SPL) 

Dynamic 

DU-1 Shrinking diameter 

DU-2 Densification Power Law (DPL) 
DU-3 Diameter plot and Gelling point 
DU-4 Constant/Oscillating NLCCs 

DU-5 Stable fractal dimension of CCs 
DU-6 ‘Rebel’ prob. to NLCCs 

DU-7 Principal eigenvalue over time 

DW-1 Weight Power Law (WPL) 

DW-2 Bursty weight additions 

DW-3 Weighted principal eig. over time 


Table 2.2: Summary of real graph properties. Patterns in bold face arc introduced in this thesis. 
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SU-1: Heavy-tailed Degree Distribution 


The degree distribution of many real graphs obey a power law of the form f(d ) oc d~ a , with the 
exponent a > 0, and f(d ) being the fraction of nodes with degree d. Such power-law relations 
as well as many more have been reported in [Chakrabarti et al., 2004b; Faloutsos et al., 1999; 
Kleinberg et al., 1999; Newman, 2005]. Intuitively, power-law-like distributions for degrees state 
that there are many low degree nodes, whereas only a few high degree nodes (also referred to as 
“hubs”) exist in real graphs. 


SU-2: Small Diameter 

One of the most striking patterns that real-world graphs have is a small diameter, which is also 
known as the “small-world phenomenon” or the “six degrees of separation”. Intuitively, the 
diameter represents how much of a “small world” the graph is- how quickly one can get from 
one “end” of the graph to another. 

Many real graphs were found to exhibit surprisingly small diameters- for example, 19 for the 
Web [Albert et al., 1999], and the well-known number 6 in social networks [Barabasi, 2003]. 
Most recently [Kang et al., 2010b] showed that even a billion-node Yahoo! Web graph has 
diameter only 7.6. 


SU-3: Triangle Power Law (TPL) 

The number of triangles A and the number of nodes that participate in A number of triangles 
follow a power-law in the form of /(A) oc A CT , with the exponent a < 0 [Tsourakakis, 2008]. 
The TPL intuitively states that while many nodes have only a few triangles in their local neigh¬ 
borhoods, a few nodes participate in many number of triangles with their neighbors. The local 
number of triangles is also related to the clustering coefficient of graphs. 


SU-4: Spectral properties and Eigenvalue Power Law (EPL) 

There have been several studies on spectral properties of power-law graphs. [Siganos et al., 2003] 
examined the spectrum of the adjacency matrix of the AS Internet topology and reported that 
the 20 or so largest eigenvalues of the Internet graph are power-law distributed with exponent 
between .45 and .5. [Mihail and Papadimitriou, 2002] later provided an explanation for the 
“Eigenvalue Power Law”, showing that it is a consequence of the “Degree Power Law”. 

[Farkas et al., 2001] studied the numerical and analytical properties of the adjacency matrices 
of complex networks and reported surprising results on the spectra of adjacency matrices corre¬ 
sponding to several models of real-world graphs. Moreover, [Chung et al., 2003] analyzed the 
eigenvalues of random graphs for which the number of nodes of degree d follow a power law and 
reported bounds on the first and second eigenvalues of such graphs for certain parameters. 
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SU-5: Community Structure 


Real-world graphs are found to exhibit a modular structure, with nodes forming groups, and 
possibly groups within groups [Flake et al., 2002; Girvan and Newman, 2002; Schwartz and 
Wood, 1992]. In a modular graph, the nodes form communities where groups of nodes in the 
same community are tighter connected to each other than to those nodes outside the community. 
Quantitative measures for such a structure include modularity [Girvan and Newman, 2002] and 
conductance [Andersen et al., 2006]. 


DU-1: Shrinking Diameter 

[Leskovec et al., 2005b] showed that not only is the diameter of real graphs small, but it also 
shrinks and then stabilizes over time. This pattern can be attributed to the “gelling point” and the 
“densification” in real graphs both of which are described in the following sections. Briefly, at 
the “gelling point” many small disconnected components merge and form the largest connected 
component in the graph. This can be thought as the ‘coalescence’ of the graph at which point 
the diameter ‘spikes’. Afterwards, with the addition of new edges the graph ‘densities’ and the 
diameter keeps shrinking until it reaches an equilibrium. 


DU-2: Densification Power Law (DPL) 

Time-evolving graphs follow the “Densification Power Law” with the equation E(t ) oc N{tf, 
at all time ticks t [Leskovec et al., 2005b], where [3 is the densification exponent, and E(t) and 
N(t) are the number of edges and nodes at time t, respectively. 

Real graphs studied are shown to obey the DPL, with exponents between 1.03 and 1.7. The 
power-law exponent being greater than 1 indicates a super-linearity between the number of nodes 
and the number of edges in real graphs. That is, it implies that for example when the number of 
nodes iV in a graph doubles, the number of edges E more than doubles- hence the densification. 
It also explains away the shrinking diameter phenomenon described earlier. 


2.3.2 Previous studies on human communications 

There has been previous work on studying human-to-human communication networks in order to 
understand various aspects of human behavior. [Onnela et al., 2007b] and [Onnela et al., 2007a] 
built a network from mobile phone calls records and, from it, they make a detailed analysis of its 
network properties. They identified relationships between node weights and network topology, 
finding that the weak ties are commonly responsible for linking communities, thus having a high 
betweenness centrality or low link overlap. Moreover, [Hidalgo and Rodriguez-Sickert, 2008] 
verified that the persistence of an edge is highly correlated to its reciprocity and to the topological 
overlap. It is also common to analyze the networks from mobile companies in order to improve 
their services. For instance, [Cortes et al., 2001; Hill and Nagle, 2009] proposed a framework 
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and data structures for identifying fraudulent consumers on telecommunication networks based 
on their degree distribution and dynamics and, [Nanavati et al., 2006] proposed metrics that can 
be employed by a business strategy planner involved in the telecom domain. 

Another use for a mobile phone dataset is to study the individual attributes of the users. [Guo 
et al., 2007] analyzed mobile phone calls that arrived in a mobile switch center in a GSM system 
of Qingdao, China, and they found that the duration of the phone calls is best modeled by a 
log-normal distribution. Later, [Seshadri et al., 2008] proposed the DPLN distribution to model 
the distributions of the number of phone calls per customer, the total talk minutes per customer 
and the distinct number of calling partners per customer. [Willkomm et al., 2008] studied the 
duration of mobile calls arriving at a base station during different periods and also found that 
they are neither exponentially nor log-normally distributed, possessing significant deviations that 
make them hard to model. They verified that about 10% of calls have a duration of around 27 
seconds, that correspond to calls which the called mobile users did not answer and the calls were 
redirected to voice-mail. 


2.4 Datasets 


We studied several large real networks, described in detail in Table 2.3, in terms of size, weights, 
and explanation of edges. Several of our graphs had no obvious weighting scheme: for example, 
a single paper or patent will cite another only a single time. The graphs that did have weights 
are also further divided into two schemes, multi-edges and edge-weights. In the edge-weights 
scheme, there is an obvious weight on edges, such as dollar amounts in campaign donations, or 
packet-counts in network traffic. For multi-edges, weights are added if there is more than one 
interaction between two nodes. For instance, if a blog cites another blog at a given time, its 
weight is 1. If it cites the blog again later, the weight becomes 2. In all the datasets, we assume 
that edges are never deleted, because edge deletion never explicitly appeared. 

The datasets are gathered from publicly available data. NIPS ', Arxiv and Patent [Leskovec et al., 
2005b] are academic paper or patent citation graphs with no weighting scheme. IMDB indicates 
movie-actor information, where an edge occurs if an actor participates in a movie [Barabasi and 
Albert, 1999]. Netflix is the dataset from the Netflix Prize competition 2 , with user-movie links 
(we ignored the ratings); we also noticed that it only contained users with 100 or more ratings. 
BlogNet and PostNet are two representations of the same data, hyperlinks between blog posts 
[Leskovec et al., 2007b]. In PostNet nodes represent individual posts, while in BlogNet each 
node represents a blog. Essentially, PostNet is a paper citation network while BlogNet is an 
author citation network (which contains multi-edges). 

Oregon is an autonomous systems network 3 , which contains AS peering information inferred 
from Oregon route-views BGP data. Enron contains email interactions at Enron collected from 

*www. cs . tor onto . edu/~roweis/data . html 

2 www.netflixprize.com 

3 University of Oregon Route Views project, www. routeviews . org 
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Table 2.3: Graph data sets studied for pattern discovery. K: thousand, M: million, B:billion. 



18 




about 1998 to 2002 (made public by the Federal Energy Regulatory Commission during its in¬ 
vestigation). NetTraffic records IP-source/IP-destination pairs, along with the number of packets 
sent, per unit time. Auth-Conf, Key-Conf, and Auth-Key are all from DBLP 4 , with the obvious 
meanings. CampOrg and Camplndiv are bipartite graphs from U.S. Federal Election Commis¬ 
sion. They record donations (dollar amounts) between organizations and political candidates, 
and between individuals and organizations 5 . 

HEP-PH and HEP-TH contain Physics phenomenology and theory paper citations, respectively. 
YahooWeb graph contains 1.4 billion Web pages and 6.6 billion links among them. It was crawled 
by Yahoo-Altavista search engine in 2002. It contains around 31 million unique sites. The 
top 3 sites with largest number of pages are http://metaquest.bc.edu/, http://www.cricket.org/, 
http://www.nyu.edu/. 

Datasets used to analyze patterns in human communications span several months of activity and 
cover different types of communications including phone call, SMS (Short Message Service), 
and IM (Instant Message). The edge weights denote the total count of phone calls, SMSs, and 
IMs exchanged as well as total durations of calls aggregated in minutes. In particular, CALL 
and SMS are the phone call and SMS interactions among the same anonymous individuals. The 
dataset was collected over a period of six months, December 1, 2007 through May 31, 2008 in 
a big city in Asia. Finally, Commun is another communications dataset among a different set of 
individuals from another big city in Asia. 


4 dblp.uni-trier.de/xml/ 

5 www. cs . emu . edu/~mmcgloho/fec/data/fec_data . html 
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Chapter 3 

Patterns in network topology 


PROBLEM STATEMENT: What are the typical patterns in the structure and dynamics of real- 
world networks from a large collection of diverse domains (e.g. political campaign donations, 
computer network traffic, online social network, large Web graph)? 


In this chapter, we introduce several new patterns we found in real networks we studied. An 
overview of all the existing as well as new patterns has been given in Table 2.2, the ones in bold¬ 
face are our contributions. As one can notice, our major contributions include mainly dynamic 
and weighted patterns. For the purposes of organization we will present them in two categories; 
unweighted and weighted graph patterns. 


3.1 Unweighted graph patterns 

3.1.1 Pattern SU-6: Chain-like Next-Largest Connected Components (NL- 
CCs) 


We study the relations between the average effective radius and the order of connected compo¬ 
nents. Recall that the effective radius of a node in a component is the 90-percentile of all the 
distances from that node to other nodes in the component. The average effective radius (AER) 
of a connected component is then the average of the effective radii in the component. 

Figure 3.1 (a) shows the number of nodes and the AER of connected components in YahooWeb. 
We first see that there are many components which form cliques (=AER close to 1), stars (=AER 
close to 2), and chains(=AER proportional to the number of nodes). Since a component that 
behaves like a clique seems suspicious, we looked at the top 4 largest clique-like components. It 
turns out that they all belong to Germany; they have the same number of nodes (305), and seem 
to be phishing sites from the same owner as the contents look similar, which existed several years 
ago but non-existing any more. 
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Figure 3.1: (a) Connected components map of the YahooWeb graph, showing the Average Ef¬ 
fective Radius(AER) vs. number of nodes in each component. Each point corre¬ 
sponds to a connected component. Notice the effective radius is bounded by the 
maximum(by chains) and the minimum(by clique), (b) Maximum Effective Ra- 
dius(MER) vs. Average Effective Radius(AER). Notice that small components be¬ 
have like chains in terms of radius, and the giant connected component (GCC) is 
very different from others due to the large MER compared to the AER. 


Another observation is that the upper left boundary of Figure 3.1 (a) forms a near-line. Obviously, 
these nodes came from chains since they have the maximum average effective radius. We can 
prove that the boundary is a near-line since the average effective radius and the number of nodes 
in chain graphs have the following near-linear relationship: 

Lemma 1 (Small Radius of Chain Graph) The average effective radius of a chain graph with 
n nodes is 0.6525 n — 0.45. 

Proof 1 We need to consider only the first | nodes by the symmetry. Since the effective ra¬ 
dius is the 90th-percentile, the ith nodes where 1 < i < 0.45n have the effective radius 
(0.9/7 — i), and the last 0.05/7 nodes have radius 0.45n. Therefore, the average effective ra¬ 
dius is E^AO-gn-O+O^SnxO.OSn = q _ Q 45 

0.5 n 

Next, we look at the maximum effective radius versus the average effective radius of each com¬ 
ponent in Figure 3.1 (b). We have the following observation. 

Observation 3.1.1 (Chain-like Next-largest Connected Components) Next-largest connected 
components have relatively small MER (Maximum Effective Radius) vs. AER (Average Effective 
Radius) ratio. Only the giant connected component has the high ratio. 

The reason of the high maximum radius vs. average effective radius ratio of the GCC can be 
explained by the thick cores containing majority of nodes and several tendrils it has [Broder 
et al., 2000]. The thick cores decrease the average effective radius, while the tendril increases the 
maximum radius. On the contrary, NLCCs of the YahooWeb graph do not have enough number 
of nodes to form thick cores which could have decreased the average effective radius. In terms 
of MER vs. AER ratio, the NLCCs are similar to chain graphs. 
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3.1.2 Pattern DU-3: Diameter Plot and “Gelling” Point 


Studying the effective diameter of the graphs, we notice that there is often a point in time when 
the diameter spikes. Before that point, the graph is more or less in an establishment period, 
typically consisting of a collection of small, disconnected components. This “gelling point ” 
seems to also be the time where the GCC “takes off”. After the gelling point, the graph obeys the 
expected rules, such as the densification power law; its diameter decreases or stabilizes; the giant 
connected component keeps growing, absorbing the vast majority of the newcomer nodes. 

Observation 3.1.2 (Gelling point) Real graphs exhibit a gelling point, at which the diameter 
spikes and (several) disconnected components gel into a giant component. 

We show diameter plots over time in Fig. 3.2 (a), Fig. 3.3 (a)s, and Fig. 3.4 (a)s. In most of these 
graphs, both unipartite and bipartite, there are clear early gelling points. For example, in NIPS 
the diameter spikes at t — 8 years, which is a reasonable time for an academic community to 
gel. In some networks, we only see one side of the spike, due to data construction (the nature of 
trace-routes in Oregon and NetTraffic ) or massive network size (Patent). 



(c) N(t) vs E(t) 


(d) GCC, CC2, and CC3 (log-lin) 


Figure 3.2: Properties of PostNet network. Notice that we experience an early gelling point at (a) 
(diameter versus time), stabilization/oscillation of the NLCC sizes in (b) (size of 2nd 
and 3rd CC, versus time). The vertical line marks the gelling point. Part (c) gives 
N(t ) vs E(t) in log-log scales - the good linear fit agrees with the Densification 
Power Law. Part (d): component size (in log), vs time - the GCC is included, and it 
clearly dominates the rest, after the gelling point. 
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Figure 3.3: Properties of other unipartite networks. Diameter plot (left column), and NLCCs 
over time (right); vertical line marks the gelling point. Datasets from top to bottom: 
Patent, Arxiv, NIPS, BlogNet, NetTraffic. All datasets exhibit an early gelling point, 
and stabilization of the NLCCs. 
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Figure 3.4: Properties of bipartite networks. Diameter plot (left column), and NLCCs over time 
(right), with vertical line marking the gelling point. Datasets from top to bottom: 
1MDB , CampOrg, Camplndiv, Netflix. Again, all datasets exhibit an early gelling 
point, and stabilization of the NLCCs. Netflix has strange behavior because it is 
masked (see data description 2.4) and text. 
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3.1.3 Pattern DU-4: Constant/Oscillating NLCCs 


How do the next-largest connected components evolve, especially after the gelling point? We 
particularly study the second and the third largest connected components over time. We notice 
that, after the gelling point, the sizes of these components oscillate over time. Further investiga¬ 
tion shows that the oscillation may be explained as follows: new-comer nodes typically link to 
the GCC; very few of the newcomers link to the 2nd (or 3rd) CC, helping them to grow slowly; 
in very rare cases, a newcomer links both to an NLCC, as well as the GCC, thus leading to the 
absorption of the NLCC into the GCC. It is exactly at these times that we have a drop in the size 
of the 2nd CC: Note that edges are not removed, thus, what is reported as the size of the 2nd CC 
is actually the size of yesterday’s 3rd CC, causing the apparent “oscillation”. 

An unexpected (to us, at least) observation is that the largest size these components can get seems 
to be a constant. This is counter-intuitive - based on random graph theory, we would expect the 
size of the NLCCs to grow with increasing N [Bollobas, 2001]. Using scale-free arguments, we 
would expect the NLCCs to have size that would be a (small, but constant) fraction of the size of 
the GCC - to our surprise, this never happened, on any of the real graphs we analyzed. If some 
underlying growth does exist, it was small enough to be impossible to observe throughout the 
(often lengthy) time in our datasets. 

Observation 3.1.3 (Constant/Oscillating NLCCs) After the gelling point, the secondary and 
tertiary connected components remain of approximately constant size, with small oscillations. 

We show full results for PostNet in Fig. 3.2, including the diameter plot (Fig. 3.2 (a)), sizes of the 
NLCCs (Fig. 3.2 (b)), densification plot (Fig. 3.2 (c)), and the sizes of the three largest connected 
components in log-linear scale, to observe how the GCC dominates the others (Fig. 3.2 (d)). 
Results from other networks are similar, and are shown in condensed form for space (Fig. 3.3 for 
unipartite graphs, and Fig. 3.4 for bipartite graphs). 

The second columns of Fig. 3.3 and Fig. 3.4 show the NLCC sizes versus time. Notice that, after 
the “gelling” point (marked with a vertical line), they all oscillate about constant value (different 
for each network). The only extreme cases are datasets with unusually high connectivity. For 
example, Netflix has very small NLCCs. This may be explained by the fact the dataset is masked, 
omitting users with less than a hundred ratings (possibly to further protect the privacy of the 
encrypted user-ids). Therefore, the graph has abnormally high connectivity. We note that Oregon 
and NetTrajfic have unusually high connectivity due to the nature of network traffic- it benefits 
from having a single GCC, so NLCCs shrink to zero. 


3.1.4 Pattern DU-5: Stable Fractal Dimension of Connected Components 

Do next-largest connected components have the same structural properties as the giant connected 
component? We study the homogeneity of components in large networks. The main questions of 
interest are the followings: (a) How can we characterize the density of a connected component? 
Can we have an intrinsic measure that is invariant to the growth of the connected component? 
(b) Do connected components have same densities? 
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For the study, we investigate the connected components of YahooWeb. The giant connected com¬ 
ponent contains 690 million web pages, which is about 50% of the total pages. The second 
and the third largest connected components contains 57,000 and 21,000 pages which is much 
smaller than the giant connected component. Except the isolated(=one node) connected com¬ 
ponents, there are 2.6 million connected components. Our goal is to find interesting patterns 
on the so-many connected components of different sizes to better understand the evolution of 
networks. 

We first characterize the density of connected components. Many different definitions can be 
given: the most intuitive one is arguably the ratio of the size (number of edges) and the order 
(number of nodes) of the graph. However, [Leskovec et al., 2005b] states that there is a power- 
law relationship between the size and the order of the whole graph, and the exponent remains 
constant over time. Therefore, scaling it by log is more natural and thus we propose the Graph 
Fractal Dimension (GFD) to characterize the “density” of a connected component. 

Definition 1 (Graph Fractal Dimension) The graph fractal dimension GFD(C) of a connected 
component C is the ratio of the size and the order in log scale. That is, GFD(C ) = j^]v(G')j - 

For example, a clique will have GFD « 2, because there are n 2 — n edges for n nodes. A chain 
will have GFD « 1, because there are n — 1 edges for n nodes. Several graphs and their fractal 
dimensions are shown in Figure 3.5. 



(a) Chain: 
GFD 1.27 


(b) Star: 
GFD 1.27 


(c) Bipartite-Core: 
GFD 1.64 


(d) Clique: 
GFD 1.94 


Figure 3.5: Graphs, adjacency matrices and their fractal dimensions. Notice that the GFD can 
be used as a measure of density of graphs: chain or star graphs have smaller GFDs 
while cliques have higher GFDs. 


Graph Fractal Dimension 

How are the graph fractal dimensions of connected components distributed? Do they share reg¬ 
ularities? Figure 3.6 shows the graph fractal dimension of connected components in YahooWeb 
graph. In Figure 3.6 (a), there exists various components with wide range of number of edges for 
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Figure 3.6: Homogeneity in the fractal dimension of components in YahooWeb. (a) Number of 
edges vs. number of nodes. Each point corresponds to a connected component, (b) 
Average number of edges of components (y-axis) with the corresponding number of 
nodes (x-axis). Notice the fractal dimension of components fits in a line. 


a fixed number of nodes, between the minimum (tree) and the maximum (clique). However, after 
averaging the number of edges in Figure 3.6 (b), we observe the following striking pattern. 

Observation 3.1.4 (Homogeneity of Components GFD) Graph fractal dimensions of connected 
components in YahooWeb graph are constant on average. 

This observation implies that on average, the connected components of the Web graph are self¬ 
similar, regardless of the size of the network. 


Graph Fractal Dimension over time 

Here we broaden our focus to time-evolving graphs and study the dynamic aspects of the con¬ 
nected components. The main questions is: Will the connected components grow with the same 
rate? Will there be a change of growth rate around the gelling’ point where the diameter of the 
graph starts to shrink? 

To answer the above question, we look at the graph fractal dimension of top 3 largest connected 
components over time in Figure 3.7 and summarize findings in the following observation. 

Observation 3.1.5 (Evolution of Top 3 Connected Components) The giant connected compo¬ 
nent and two next-largest connected components grow with the same rate. Their graph fractal 
dimensions remain the same until a deviation point. The deviation point is close to the “gelling” 
point where the diameter starts to shrink. 

This observation is interesting, since it implies that some barriers between the nodes seem to 
collapse after the gelling point, and the nodes in the network are connected with higher rate than 
before the gelling point. 
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Figure 3.7: Growth of connected components in terms of the graph fractal dimension. Each 
point represents the snapshot of a connected component over time. Notice that the 
slope remains constant until a “deviation” point (the second vertical line) close to a 
“gelling” point (the first vertical line), and starts to increase after that. The deviation 
points are about one year after the gelling points. 


3.1.5 Pattern DU-6: Exponential “Rebel” Probability to NLCCs 


Given a newcomer to a network, what is the probability that it will be absorbed, or not absorbed 
to the GCC? We call it as the “rebel” probability and give its relations to the degree of newcomers 
and the portion of nodes in the NLCCs. 


We first give the relationship of the rebel probability and the degree of newcomers in Figure 3.8. 
In the figure, we see that the probability is linear to the degree in log-lin scale where the slope 
decreases as the network grows. In addition, we show the relationship of the rebel probability 
and the portion of nodes in the NLCC in Figure 3.9. From Figure 3.9, we see the probability 
is linear to the portion of nodes in NLCC in log-log scale, and the slope increases as the degree 
increases. Given these two observations, we give empirical rebel probability of newcomers as a 
function of the degree(d) and the portion(s) of nodes in the NLCC in the following observation, 
which we call the ERP (Exponential Rebel Probability) pattern. 


Observation 3.1.6 (Exponential Rebel Probability (ERP)) Given the node portion s of NL¬ 
CCs, the probability P re bei of a newcomer to be absorbed in NLCCs is exponential to the product 
of a constant a, the degree d of the newcomer, and the log of the node portion s of NLCCs: 


Prebei ^ 6 


acKlogs ) 


(3.1) 


3.1.6 Pattern DU-7: Principal Eigenvalue over time 

How does the principal eigenvalue of a graph behave over time? The principal eigenvalue is 
the largest eigenvalue associated with the adjacency matrix of the graph, and is also described 
as a measure of overall connection strength of a connectivity matrix. It has also been shown 
that the epidemic threshold in graphs depends only on this principal eigenvalue, and nothing 
else [Prakash et al., 2010; Valler et al., 2011]. 
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Figure 3.8: P(Absorption to NLCCs) vs. Degree in log-lin scale. Notice the linear drop of the 
probability as the degree increases. 





% |V| in DC 


(a) Patent 


(b) HEP-TH 


(c) HEP-PH 


Figure 3.9: P(Absorption to NLCCs) vs. Portion of Nodes in NLCCs in log-log scale. Notice 
that the slopes of curves increase as degree increases. 


Plotting the largest (principal) eigenvalue of the 0-1 adjacency matrix of our datasets over time, 
we notice that the principal eigenvalue grows following a power law with increasing number of 
edges. This observation is true especially after the gelling point. As we discussed before, gelling 
point is defined to be the point at which a giant connected component (GCC) appears in real- 
world graphs- after this point, properties such as densification and shrinking diameter become 
increasingly evident. 

Observation 3.1.7 (Ai Power Law (LPL)) In real graphs, the principal eigenvalue A | (i) and 
the number of edges E(t) over time follow a power law with exponent less than 0.5, especially 
after the “gelling” point. That is, 


Ai(t) oc E(t) a , a < 0.5 


We report the power law exponents in Fig. 3.10. Note that we fit the given lines after the gelling 
point which is shown by a vertical line for each dataset. Notice that the slopes are less than 0.5, 
with the exception of the CampOrg dataset, which has slope ~ 0.53. 

Given the theorem X ma x{G) < {2 (1 — -^) A} 2 for a connected, undirected graph Q without 
self-loops and multiple edges, with E edges and N nodes (see [Wilf, 1967] for proof), for large 
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Auth-Key 



NetTrajfic 



Auth-Conf 



BlogNet 



Figure 3.10: Illustration of the LPL (Observation 3.1.7). 1 st largest (principal) eigenvalue Ai(t) 
of the 0-1 adjacency matrix A versus number of edges E(t) over time. The vertical 
lines indicate the gelling point. 


N, s.t. — y 0, we expect the power law exponent to be less than 0.5. By construction, there 
are no multiple edges in our graphs (that is, we work with a binary adjacency matrix), and the 
nodes do not have self-loops by nature. Finally, we claim that the graphs behave as a single 
connected component after the gelling point at which point the GCC dominates other connected 
components. The only slight exception to the power law, the CampOrg graph, always has many 
number of disconnected components as well as a GCC. Thus, we conclude that our observation 
follows early theory. 


3.2 Weighted graph patterns 

3.2.1 Pattern SW-1: Edge Weights Power Law (EWPL) 

We observe that the weight of a given edge and weights of its neighboring two nodes are cor¬ 
related. Our observation is similar to Newton’s Gravitational Law stating that the gravitational 
force between two point masses is proportional to the product of the masses. Similarly, the 
tendency of two nodes to interact often would be related to the “popularity” of both. 

For each edge (i,j) at the final time step in the graph, we plot \J(w t — u: U] ) * (w 3 — w U] ) versus 
its weight w tJ as a single point. Notice that we did not include the weight of the edge itself 
in the total weight of its incident nodes. Next, we fit a line to the median y-axis values after 
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applying logarithmic binning on the x-axis and report the corresponding slopes for each dataset 
in Fig. 3.11. Note that, we omit the points which represent edges to avoid the confusion due to 
overplotting. Instead, we show the 75% and 25%-tile of the data with upper and lower vertical 
bars, respectively. 

Observation 3.2.1 (Edge Weights Power Law (EWPL)) Given a real-world graph Q, ‘com¬ 
munication’ defined as the weight of the link between two given nodes has a power law relation 
with the weights of the nodes. In particular, given an edge e tJ with weight vj hj and its two 
neighbor nodes i and j with weights Wi and Wj, respectively, 


Wij (X 


\f O i ~ WiJ) * {Wj 



7 


EWPL can be used in link prediction; that is, one can estimate the probable weight of a future link 
between two nodes of the graph, given their weights. Moreover, edges with weights deviating 
too much from the expected might be flagged for further consideration. 




NetTraffc 




Auth-Key 


Auth-Conf 


Wij 

Key-Conf 



Figure 3.11: Illustration of the EWPL (Observation 3.2.1). Given the weight of a particular edge 
in the final snapshot of real graphs (x-axis), the multiplication of total weights(y- 
axis) of the edges incident to two neighboring nodes follow a power law. A line 
can be fit to the median values after logarithmic binning on the x-axis. Upper and 
lower bars indicate 75% and 25% of the data, respectively. 
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3.2.2 Pattern SW-2: Snapshot Power Law (SPL) 


Is there any correlation between the in-degree and the in-weight, and between the out-degree and 
the out-weight, for all the nodes of a graph, at a given time-stamp? If node i has out-degree outi, 
what can we say about its out-weight outwp. We find that there is a “fortification effect” here, 
i.e. superlinear growth, resulting in more power laws, both for out-degrees/out-weights as well 
as for in-degrees/in-weights. 

Specifically, at a given point in time, we plot the scatter-plot of the in/out weight versus the 
in/out degree, for all the nodes in the graph, at a given time snapshot. An example of such a 
plot is in Fig. 3.12 (a) and (b). Here, every point represents a node and the x and y coordinates 
are its degree and total weight, respectively. To achieve a good fit, we bucketize the x axis with 
logarithmic binning [Newman, 2005], and, for each bin, we compute the median y. We observe 
that the median values of weights versus mid-points of the intervals follow a power law for all 
datasets studied. Formally, the “Snapshot Power Law” is: 

Observation 3.2.2 (Snapshot Power Law (SPL)) Consider the i-th node of a weighted graph, 
at time t, and let ouf, outWi be its out-degree and out-weight. Then 

outw.i oc out™ 

where ow is the out-weight-exponent of the SPL. Similarly, for the in-degree, with in-weight- 
exponent iw. 

We studied the snapshot plots for several time-stamps (for brevity, we only report the slopes for 
the final timestamp in Table 3.1 for all the datasets we studied). We observed that SPL exponents 
of a graph over time remains almost constant. In Fig. 3.12 (a) and (b), the inset plots show how 
the iw and ow exponent changes over time (years) for the CampOrg dataset, respectively. We 
notice that iw and ow take values in the range [0.9-1.2] and [0.95-1.35], respectively: 

Observation 3.2.3 (Persistence of Snapshot Power Law) The in- and out-exponents iw and 
ow of the SPL remain about constant, over time. 

Looking at Table 3.1, we observe that all SPL exponents are > 1, which imply a ‘fortification 
effect ” with super-linear growth. The only exception is the NetTraffic dataset. This is explained 
because the number of nodes N has a limit that can not be exceeded (the total IP addresses at 
the institution of observation) Until N reaches that point, the slopes are iw= 1.19 and oiv= 1.27 
(again, showing a “fortification effect”). 


3.2.3 Pattern DW-1: Weight Power Law (WPL) 

Let W(t) be the total weight up to time 1 (e.g., the grand total of all exchanged packets in a 
network), E{t) the number of distinct edges up to time t, and Ed(t) the number of multi-edges 
(the d subscript stands for duplicate edges), up to time t. Is there any correlation between the 
total weight, the total number of edges and the total number of multi-edges in a graph, as they 
change over time? 
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(a) SPL (in-weight vs. in-degree) 



(b) SPL (out-weight vs. out-degree) 



(c) WPL 


Committee-to-Candidate entropy plot 



Figure 3.12: Weight properties of CampOrg donations: SPL plots (a) and (b) have slopes 
(power-law exponents) > 1 (“fortification effect”), that is, that the more campaigns 
an organization supports, the superlinearly more money it donates, and similarly, 
the more donations a candidate gets, the more average amount-per-donation is re¬ 
ceived. Inset plots show the exponents iw and ow over time, which stay quite 
stable; (c) shows all the power laws as well as the WPL; the entropy plot has slope 
~ 0.86 in (d), indicating bursty weight additions over time. 


If every pair generated k packets, the relationships would be linear: if the count of pairs double, 
the packet count would double, too. This is reasonable, but it doesn’t happen! In reality, the 
packet count over-doubles, following the “WPL” below. In other words, we observe a ‘ fortifica¬ 
tion effect” here, too: more edges in the graph imply super-linearly higher total weight. 

Observation 3.2.4 (Weight Power Law (WPL)) Let E(t), W(t) be the number of edges and 
total weight of a graph, at time t. They, they follow a power law 

W(t ) = E{t) w 

where w is the weight exponent. Power-laws also link the number of nodes N (t), and the number 
of multi-edges Efft), to E(t), with exponents n and dupE, respectively. 
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w 

nsrc 

ndst 

dupE 

iw 

ow 

fd 

CampOrg 

1.53 

0.58 

0.73 

1.29 

1.16 

1.30 

0.86 

Camplndiv 

1.36 

0.53 

0.92 

1.14 

1.05 

1.48 

0.87 

NetTrafftc 

1.48 

0.57 

N/A 

1.29 

0.66 

0.45 

0.59 

BlogNet 

1.03 

0.79 

N/A 

N/A 

1.01 

1.10 

0.96 

Auth-Key 

1.01 

0.90 

0.70 

N/A 

1.01 

1.04 

0.95 

Auth-Conf 

1.08 

0.96 

0.48 

N/A 

1.04 

1.81 

0.96 

Key-Conf 

1.22 

0.85 

0.54 

N/A 

1.26 

2.14 

0.95 


Table 3.1: Power law exponents for all the weighted datasets we studied: The x-axis being the 
number of non-duplicate edges E, w: WPL exponent, nsrc, ndst : WPL exponent for 
source and destination nodes respectively (if the graph is unipartite, then nsrc is the 
number of all nodes), dupE: exponent for multi-edges, iw, ow: SPL exponents for 
indegree and outdegree of nodes, respectively. Exponents above 1 indicate fortifica- 
tion/superlinear growth. Last column, fd: slope of the entropy plots, or information 
fractal dimension. Lower fd means more burstiness. 


The weight exponent w ranges from 1.01 to 1.5 for the real graphs we have studied. The highest 
value corresponds to campaign donations: super-active organizations that support many cam¬ 
paigns also tend to spend even more money per campaign than the less active organizations. For 
bipartite graphs, we show the nsrc , ndst exponents for the source and destination nodes (which 
also follow power laws: N src (t ) = E{t) nsrc and similarly for N dst {t)). 

Fig. 3.12 (c) shows the WPL for our example dataset CampOrg. Other datasets are shown in 
Fig. 3.13. The plots are in log-log scales. We report the slopes in Table 3.1. 


3.2.4 Pattern DW-2: Bursty/Self-similar weight additions 

We tracked how much weight a graph puts on at each time interval and looking at the entropy 
plots, we observed that the weight additions over time show self-similarity. For those weighted 
graphs where the edge weight is defined as the number of recurrences of that edge, the slope 
of the entropy plot was greater than 0.95, pointing out uniformity. On the other hand, for those 
graphs where weight is not in terms of multiple edges but some other feature of the dataset such 
as the amount of donations for the FEC dataset, we observed that weight additions are more 
bursty, the slope being as low as 0.6 for the Network Traffic dataset. 

Observation 3.2.5 (Bursty/self-similar weight additions) In cdl our graphs, the addition of 
weight (AW(t)) was self-similar, with fractal dimension ranging from ~/ (smooth/uniform), 
down to 0.6 (bursty). 

Fig. 3.12 (d) shows the entropy plot for our example dataset CampOrg. Other datasets are shown 
in Fig. 3.14. AW values over time are also shown in insets at the bottom right corner of each 
figure. We report the information fractal dimensions (=slopes in entropy plots) in Table 3.1. 
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Blog Network Scatter Plot 




Camplndiv NetTraffic BlogNet 



Figure 3.13: Properties of weighted networks. Weight power laws: total weight W, number 
of duplicate edges Ed, number of nodes N; each versus number of non-duplicate 
edges E. The slopes for weight W and multi-edges E,i are above 1, indicating 
“fortification”. 


3.2.5 Pattern DW-3: Weighted principal eigenvalue over time 

Given that unweighted (0-1) graphs follow the Ai Power Law, one may ask if there is a corre¬ 
sponding law for weighted graphs. To this end, we also compute the largest eigenvalue A | V! of 
the weighted adjacency matrix A,,,. The entries Wij of A,„ now represent the actual edge weight 
between node i and j. We notice that A i increases with increasing number of edges following 
a power law with a higher exponent than that of its Ai Power Law. We show the experimental 
results in Fig. 3.15. 

Observation 3.2.6 (A L)r Power Law (LWPL)) Weighted real graphs exhibit a power law for 
the largest eigenvalue of the weighted adjacency matrix A i ( t) and the number of edges E(t) 
over time. That is, 

Ai , w (t) cx E{f) p 

In our experiments, the exponent (3 ranged from 0.5 to 1.6. 
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Camplndiv 


NetTraffic 


BlogNet 


Author-to-Keyword Entropy Plot 





Auth-Key 


Auth-Conf 


Key-Conf 


Figure 3.14: Properties of weighted networks. Entropy plots for weight addition. Slope away 
from 1 indicates burstiness (e.g., 0.59 for NetTraffic) The inset plots show the cor¬ 
responding time sequence AW versus time. Notice how bursty NetTraffic looks. 



CampOrg 




NetTraffic 




BlogNet 



Auth-Conf 


Key-Conf 


Figure 3.15: Illustration of the LWPL (Observation 3.2.6). T s/ largest (principal) eigenvalue 
X\ tW (t) of the weighted adjacency matrix A w versus number of edges Eft) over 
time. The vertical lines indicate the gelling point. 
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3.3 Summary of contributions 


The vast majority of previous work on the study of networks focused on static networks, with 
findings such as heavy-tailed degree distributions and small diameter. In this chapter, we de¬ 
scribed two novel aspects of this thesis in the study of real-world networks: (1) dynamic networks 
and (2) weighted networks, with interesting new findings (see Table 2.2 for an overview). 

More specifically, while previous work focused on the single giant connected component of 
networks, we shifted our focus to the next-largest connected components and their formation 
and evolution. We analyzed the structural properties of these smaller components such as their 
intrinsic dimension, as well as their temporal properties such as how they grow and get merged 
to the giant component over time. We further studied the behavior of the newcomer nodes and 
formulated how their microscopic dynamics (“rebel” probability) explain the properties observed 
on the macroscopic level (components oscillating around constant-size). 

Moreover, we started the study of weighted networks where the recurrences of interactions (e.g. 
multiple emails) or natural weights associated with interactions (e.g. dollar amount donations) 
are taken into account. We analyzed the correlation of edge weights with other network proper¬ 
ties as well as the additions of weights to the network over time. 

In summary, our contributions described in this chapter are the following: 

• Study of numerous real-world datasets: We studied numerous real-world networks from 
many diverse domains including political campaign donations, computer network traffic, 
blog citations, social and Web networks. We discovered several new patterns in dynamic 
and weighted graphs. 

• “Rebel probability” and other dynamic component patterns: We focused on the rarely- 
studied next-largest connected components (NLCCs) and their dynamics. In particular, we 
showed that (1) NLCCs look like chain graphs; (2) real graphs exhibit a “gelling” point at 
which the small components merge and a giant connected component (GCC) emerges; (3) 
after the gelling point, secondary and tertiary components cannot grow beyond a certain 
size beyond which they get absorbed to the GCC; (4) graph fractal dimension of compo¬ 
nents is stable over time; and (5) rebel probability —that a newcomer node will not join the 
GCC— drops exponentially with its degree and increases linearly with the fraction-size of 
NLCCs. 

• “Fortification effect” and other weighted network patterns: We focused on the not-studied 
weights on graph edges and their dynamics. In particular, we showed that (1) total weight 
of a node and its degree follow power laws, and weight between two nodes follows a 
gravitational-force-like relation with the total weights of those nodes; (2) total weight of a 
graph grows superlinearly with number of its edges over time; (3) weight additions over 
time are bursty; and (4) principal eigenvalues and number of edges of a graph follow a 
power law relation over time. 
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Chapter 4 

Patterns in human communications 


PROBLEM STATEMENT: What are the typical patterns in human communications ? How large 
are our social circles? How reciprocal are we, and what does it depend on? How long are our 
phone calls, and how can we summarize their distribution ? 


In this chapter, we present our findings in millions of communication records, including phone 
calls, short, and instant messages, of millions of users. In particular we study the social circles 
(=maximal cliques) people belong to, their reciprocity behavior, as well as call durations. 


4.1 Patterns in social circles 

4.1.1 Motivation 

What patterns should we expect in a network of human-to-human interactions? How can we spot 
anomalies (e.g., tele-marketers, spammers)? Related applications are numerous and almost ev¬ 
erywhere in people’s modern life. Online social networks, like Facebook (www. f acebook . com) 
and Linkedln (www. linkedin . com), mimic publicly the telecommunication networks where 
and what people communicate privately. Product recommendation systems, such as Amazon 
(www. amazon . com) and Netflix (www. netf lix . com), rely on a network of trust and col¬ 
laboration. Computer networks have predictable relations regarding intrusion detection, security, 
and virus propagation. It is important in all the above applications to spot anomalies and outliers. 
Anomaly detection is tightly connected to patterns: if most of the nodes in our network closely 
follow a pattern, then the few deviations that do exist are probably outliers. 

In this section, we are investigating the following questions: 

• When we isolate the cliques in a network, what patterns do they follow? How large are our 
social circles on average? If someone has many contacts, does that indicate popularity? 
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• What patterns do the edge weights follow, both in triangles and in general cliques? Specif¬ 
ically, in a triangle, all three nodes are equivalent in topology, but is it normal if all three 
weights are equal as well? 


4.1.2 Data Description 

The datasets analyzed are built from a large collection of records from several human commu¬ 
nication services including voice, data, IM, SMS etc. Each record is represented as a triple 
< ID, n ID v Time >, where < ID,i > and < I I) :i > are generally referred to as the caller 
and callee. During a particular time period, there can be multiple times for a pair of people to 
communicate with each other, and the accumulated number of communication times between 
IDi and / D, is defined as the edge weight between node i and j. We have the weighted graphs 
extracted from the records of three types of services (SI, S 2 and S3), referred to as Q S1 , Q S2 , 
and Q S?J respectively. Each type of service has on average about 1 million records, which were 
collected by different geographic locations. Apart from this spatial diversity and the service type 
variety, we also incorporate temporal diversity by collecting data for each type of service during 
five consecutive time periods represented from T1 to T 5, so f? 7 s j is the graph of service type SI 
in time period T 1, and is the graph of service type S2 in time period T5 etc. 

Notice that we only focus on the link between the caller and callee. It is important to know that 
our work is only an aggregate statistical analysis, and therefore, we do not study any individual’s 
behavior from any specific type of communication service. More importantly, any information 
that could identify users is stripped to access. We only use the encrypted user id in this study, 
and restrict our interest only in the statistical findings that are held within the networks. 

We note that unlike some artificial social networks, such as the scientific collaboration network 
which emerges as a one-mode projection of the bipartite graph between authors and papers, 
the massive anonymized human communication networks are formed from the real-time direct 
contact events of people. They can fully capture the underlying realistic social structures, and 
lay a solid foundation for our upcoming work. 


4.1.3 Patterns and Observations 

We report three newly discovered patterns that our datasets seem to follow, and discuss the po¬ 
tential ways in which they can be utilized. The first is Clique-Degree Power-Law (CDPL), corre¬ 
lating the zth largest degree with the average number of maximal cliques, which seems to remain 
rather stable over time so that we trust them to further detect outliers and spot anomalies. The 
second is Clique Participation Law (CPL), which gives the distribution of the number of maxi¬ 
mal cliques that each node participates in. Finally, the third comes Triangle Weight Law (TWL), 
describing how the weights are distributed on the edges of triangles, based on which we could 
further make predictions about the missing values of the edge weights in time-evolving weighted 
networks. 
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Clique-Degree Power-Law 


As defined previously, Vvj e VQ, di is the number of all the partners that v t has, and C(v t ) 
represents the set of all the maximal cliques that v, participates in. Is there any relationship 
between di and \C{vi)\ ? We can imagine that if a particular user has doubled his partners, it 
tends to be easier for him to participate in doubled social circles as well. This ki nd of relationship 
seems to be linear, and sounds reasonable. However, this is often not the case. For our real world 
social networks, the number of social circles actually over-doubles by following a Clique-Degree 
Power-Law. 

Figure 4.1 plots the number of partners vs. the number of maximal cliques averaged over all the 
nodes with that many of partners, from T1 to T 5. The result is surprising because for any given 
node, the clique-participation is super-linearly related to its degree. In addition, we notice that 
the exponent takes values in the range [1.84,1.88], [2.04, 2.21] and [1.41,1.58] for G S1 , G 52 , and 
G s 3 , which seems to be stable over time. 

Observation 4.1.1 (Clique-Degree Power-Law (CDPL)) The number of maximal cliques that 
a node participates in, is super-linearly related to its degree. Given di and C^ g , they follow a 
power-law : 

Ct 9 ~ df (4.1) 

where a is the exponent of CDPL, and remains about constant over time. 

The direct application of CDPL is to spot outliers. In Figure 4.1, all of the detected anomalies 
are marked by red circles. We can see that these points present a clear pattern which does not 
conform to the established normal behavior. In other words, for these users the actual number 
of the maximal cliques that they belong to is significantly distant from the one that they should 
have according the number of their friends. 

It is also interesting to notice that some outliers are stable and persistent, such as node v x and v y 
from Qj \ to , while others are more casual and bursty, such as node v z in Q ^\, and the circled 
outliers in Figure 4.2 shows the egocentric subgraphs centered with node v y and v z , which 

are composed of the connections among their neighbors in T 5. Clearly, for node v y , although 
it has a large number of partners, it only belongs to few maximal cliques on the left upper part 
of Figure 4.2 (a). As to node v z shown in Figure 4.2 (b), almost no connections exist among 
its partners. Because any automatic customer service id is excluded from our communication 
networks, the anomalous behavior of v y and v z makes them more like the tele-marketers. In fact, 
there are more outliers in the last time period T 5 than the others, especially in the network Qfi of 
the third communication service. We guess it is probably because there is actually a big holiday 
in T 5, and the third communication service is the cheapest for broadcasting messages. 


Clique Participation Law 

Based on the discovered maximal cliques, we are able to study how people get involved into 
them. Figure 4.3 shows the distribution of the number of maximal cliques that people actually 
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Average #Maximal Cliques Average #Maximal Cliques Average #Maximal Cliques Average #Maximal Cliques Average #Maximal Cliques 







Figure 4.1: Illustration of Clique-Degree Power-Law (Observation 4.1.1). Number of partners 
vs. the average number of maximal cliques in G S1 rsj G s3 (rows) from T 1 to T 5 
(columns). All exponents arc fitted with R 2 > 0.95. Notice that CDPL is very stable 
over time. Outliers are marked by red circles. 
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(a) Centered with vertex v y (b) Centered with vertex v z 

Figure 4.2: Detected typical outliers. Both v y and v z (in and from Figure 4.1) have too 
many unrelated partners, resulting in a star-like subgraph. 





Figure 4.3: Illustration of Clique-Participation Law (Observation 4.1.2). PDF of #Maximal 
Cliques in Q^\ ~ Q^\. The rest graphs behave similarly. 


participate in. That is, in graph Q, it plots the correlation between the number of maximal cliques 
(.e-axis) and the PDF of nodes (y-axis) that get involved in that many of maximal cliques. We 
observe that there exists a power-law followed by this kind of relationship, which is called Clique 
Participation Law. 

Observation 4.1.2 (Clique Participation Law (CPL)) For a given number of maximal cliques, 
say n dique , and the set V dique = {vi\vi G V{G), \C(vi)\ = n dique }, we have 


a d i q ue ^ 


cp 


(4.2) 


where cp is the clique participation exponent of CPL, and keeps about constant over time. 



T1 

T2 

T3 

T4 

T5 

~G sr ~ 

-1.78 

-1.74 

-1.76 

-1.70 

-1.68 

G s ‘ 2 

-1.63 

-1.56 

-1.52 

-1.56 

-1.54 


-3.21 

-3.50 

-3.46 

-3.50 

-3.01 


Table 4.1: Power-Law exponents of CPL in G S1 ~ G S3 from T1 to T5. Notice the stability. 
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According to the above discussion, for most people in real world social networks, they are often 
involved in a small number of maximal cliques (or social circles). Only a few of them are really 
“social butterflies” that can actively span many social circles simultaneously. In Figure 4.3, we 
report the results from Q S1 ~ Q S3 only in T 1 for brevity, because in Table 4.1 we observe that 
CPL is rather stable over time, leading to similar plots in the rest. 

Actually, the CPL pattern could be potentially applied to help the operators to make better de¬ 
signed family plans. Because we have a model of the distribution of user behavior to form 
close-knit groups, we can propose better pricing strategies that charge users differently accord¬ 
ing to the size of their social circles. For example, in most cases people only belong to one or two 
cliques, which may be formed by their families or best friends. We can design specific billing 
plans which are favorable to the communications among members of the same clique who are 
also the customers of the same operator. Even if our friends are the customers of other operators, 
we may still like to invite them to join us, because we know that it will be good for all of us. As 
a result, this could implicitly improve the loyalty of the current users, and may further help to 
increase the rate at which new customers sign up the plans. Moreover, we can also reward a few 
loyal users who span multiple social groups, because they might help to achieve a quick market 
promotion by introducing new products and services to their friends. 


Triangle Weight Law 

According to the clique definition, each node in a clique has connections with all the other nodes. 
Although it is very intuitive that all these nodes are equivalent in topology, will this also mean that 
they could have equally close relationships? In our communication networks, the edge weight Wij 
gives the total number of contact times between i and j, which is an important indicator to show 
how intimately they could relate to each other. Since that triangle is the base case of a clique, 
given any triangle {i,j, k}, will w.jj, w lk , and w jk hold approximately equal values because of 
the structure equivalence between i, j, and A'? Although this intuitive conjecture seems to make 
sense, we have made very unexpected and striking discoveries in the real social networks, which 
are described as follows. 

Observation 4.1.3 (Triangle Weight Law (TWL)) For any triangle, let MaxWeight, MidWeight, 
and MinWeight denote the maximum, medium, and the minimum edge weight respectively. In all 
our graphs, they follow three power-laws: 

MaxWeight ~ MidWeight 01 (4.3) 

MaxWeight ~ MinWeight 13 (4.4) 

MidWeight ~ MinWeight 1 (4.5) 

where a, 3, and 7 are the power-law exponents which remain constant in weighted time-evolving 
communication networks. 

As a result, for the given triangle {i, j, k}, rather than being approximately equal, w tJ , w ik and 
Wj k are significantly different from each other. Figure 4.4 gives the results from the networks 
Q SI ~ Q Si in the same time period Tl. To achieve a good fit, we bucketize the x-axis with 
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Figure 4.4: Illustration of Triangle Weight Law (Observation 4.1.3). Minimum, medium, and 
maximum weights in all 3 pairs are plotted in logarithmic scales. Least square fits 
all have R 2 > 0.95 in Q%\ ~ Qtj 


->S 3 
'TV 


logarithmic binning [Newman, 2005], and for each bin, we compute the average value of y. The 
dotted line means the value of x equals to the value of y. 

Moreover, Figure 4.5 shows the three exponents of TWL in G S1 ~ G S3 from T 1 to T 5. Notice 
that a, (3, and 7 of these graphs take values in the range [0.5,0.7], [0.4,0.6], and [0.7,0.8], which 
seem persistent and stable. 

In practical situations, due to missing data we can only have partial network information to 
analyze. For example, in Figure 4.6, given the weighted egocentric subgraph that link e 23 belongs 
to, what can we say about the missing u' 23 ? While the link prediction tries to predict between 
which unconnected nodes a link will form next, our problem here concerns how to estimate the 
value of an edge weight, because we already know there is a link between node 2 and 3. We 
formulate this problem as the weight prediction problem, which not only is important to fill and 
complete the missing values, but also is useful for discovering anomalous links- if the actual 
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Figure 4.5: Persistence of Triangle Weight Law. Exponent a, f3, and 7 (red, blue, green) in 
G s 1 rsj G s 3 remain about constant from T1 to T5. 


value of W23 is significantly different from the predicted value, it would be highly unusual. 

Based on the above discussion, TWL can help us to 
solve the weight prediction problem. Formally, given 
e EG, let A denote the set of all the edges (ex¬ 
cluding e t] itself) of the triangles that e, 3 belongs to. 
Vefc E A, w(e k ) denotes the weight of e k . The mini¬ 
mum and maximum values of w(e k ) are represented as 
A min and A max accordingly. On one hand, if w l3 < 
A min or Wij > Amax, the numerical relationship be¬ 
tween Wij and the weights on the other two edges is 
determined, so we can use either equation (4) and (5) 
or (3) and (4) to estimate w^ directly. On the other, 
if Wij G [Amin, A max \, Wij might be the minimum in 
one triangle, while might be the maximum in another triangle. Thus, for Ve k e A, we define 
ij,e k )( x ) to represent one of the three equations (3) ~ (5) based on the particular numerical 
relationship that e i3 and e k could hold. The return value of is the estimated weight 

for edge e k given the possible value x of w r] . Here, we assume that all edge weights are positive 
integers. Let w min be the minimum estimated value of w^ when w l3 < A rnin , and w max be the 
maximum estimated value of w^ when w^ > A max . Then the optimal value of w, 3 is: 

Wij = argmin ^ (w(e k ) - (j)( eij ,e k ){x)) (4.6) 

e fc SA 



Figure 4.6: Weight Prediction Problem. 

What can we say about u> 23 ? 


where x e {w rn im w max ]. We evaluate this approach in Q S1 ~ G S3 by comparing Wij with w l3 
for each edge in Tl, T 3 and T 5. Due to the persistence of TWL we set a = 1.5, /3 = 2.2, and 
7 = 1.2 for Q sx ; a = 1.3, (3 = 1.7, and 7 = 1.4 for Q S2 ; a = 1.4, (3 = 2.1, and 7 = 1.3 for Q S3 . 
Let e = | Wij — w^ \ denote the prediction error. The the average prediction accuracy of e = 0 
(the exact prediction) and e = 1 is around 0.21 and 0.32 accordingly. One problem of this simple 
method is that it can not predict w^, if the edge e l3 does not belong to any triangle. To solve this 
problem, and further improve the prediction accuracy is an area of future work. 


45 























4.2 


Patterns in reciprocity 


4.2.1 Motivation 

One of the important aspects in human relations is the reciprocity, a.k.a. mutuality. Reciprocity 
can be defined as the tendency towards forming mutual connections with one another by returning 
similar acts, such as email and phone calls. In a highly reciprocal relationship both parties share 
equal interest in keeping up their relationship, while in a relationship with low reciprocity, one 
person is much more active than the other. 

It is important to understand the factors that play role in the formation of reciprocity as there 
exists evidence that reciprocal relationships are highly probable to persist in the future [Cesar 
A. Hidalgo, 2008]. Also, [Nguyen et al., 2010] shows that reciprocity related behaviors provide 
good features for ranking and classification based methods for trust prediction. Reciprocity 
plays other important roles in social and communication networks. For example, if the network 
supports a propagation process, such as spreading of viruses in email networks or spreading of 
information and ideas in social networks, then the presence of mutual links clearly speeds up 
the propagation. Non-existence of reciprocal links can also reveal unwanted calls and emails in 
spam detection. 

Despite its importance, reciprocity has remained an under-explored dynamic in networks. Most 
work in network science and social network analysis focus on node level degree distributions [Broder 
et al., 2000; Faloutsos et al., 1999; Leskovec et al., 2007b], communities [Nussbaum et al., 2010; 
Satuluri and Parthasarathy, 2009; Tantipathananandh et al., 2007], and triadic relations, such 
as clustering coefficients and triangle closures [Granovetter, 1973]. The study of dyadic re¬ 
lations [Xiang et al., 2010] and the related bivariate distributions they introduce is, however, 
mostly overlooked, and thus is the focus of this section. 

The motivation behind our work is grouped into two topics: 

Ml. Modeling bivariate distributions in real data 

Two vital components of understanding data at hand are studying the simple distributions in it and 
visualizing it [Yang et al., 2008]. The study of reciprocity introduces bivariate distributions, such 
as the distribution Pr (uy,-, Wji ) of edge weights on mutual edges, where association between two 
quantitative variables needs to be explored. A vast majority of existing work focus on univariate 
distributions in real data such as power-laws [Clauset et al., 2009], log-normals [Bi et al., 2001], 
and most recently DPLNs [Seshadri et al., 2008]. The study of multivariate distributions in real 
data, however, has very limited focus. 

In addition, visualization of multivariate data in 2D is hard and often misleading due to issues 
regarding over-plotting. More importantly, mere visualization does not provide a compact data 
representation as opposed to data modeling. Summarization via aggregate functions such as the 
average or the median loses a lot of information and is also not representative, especially for 
skewed distributions as found in real data. 
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Models, on the other hand, provide compact data representations by capturing the patterns in the 
data, and are ideal tools for applications such as data compression and anomaly detection. 

M2. A weighted approach to reciprocity 

Traditional work [Garlaschelli and Loffredo, 2004] usually study reciprocity on directed, un¬ 
weighted networks as a global feature which is quantified as the ratio of the number of mutual 
links pointing in both directions to the total number of links. Defining reciprocity in such an 
unweighted fashion, however, prevents understanding the degree of reciprocity between mutual 
dyads. In a weighted network, even though two nodes might have mutual links between them, the 
skewness and the magnitude of the weights associated with these links would contain more infor¬ 
mation about how much reciprocity is really there between these nodes. For example, in a phone 
call network the reciprocity between a mutual dyad where the parties make 80%-20% of their 
calls respectively is certainly different than that of a mutual dyad with 50%-50% share of their 
calls. In short, edge weights are crucial to study reciprocity as a property of each dyad rather than 
as a global feature of the entire network and give more insight into the level of mutuality. 

In this work, we analyze phone call and SMS records of 1.87 million mobile phone users from 
a large city collected over six months. The data consists of over half a billion phone calls and 
more than 60 million SMSs exchanged. Our contributions are: 

1. We observe similar bivariate distributions Pr(i Uij, w.ji ) of mutual edge weights in the com¬ 
munication networks we study. We propose the Triple Power Law (3PL) function to model 
this observed pattern and show that 3PL fits the real data with millions of points very well. 
We statistically demonstrate that 3PL provides better fits than the well-known Bivariate 
Pareto and Bivariate Yule distributions. We also use 3PL to spot anomalies, such as a pair 
of users with low mutuality where one of the parties makes 99% of the calls during the 
entire working hours, non-stop. 

2. We use weighted measures of reciprocity in order to quantify the degree of reciprocal rela¬ 
tions and study the correlations between reciprocity and local topological features among 
user pairs. Our results suggest that mutual users with larger local network overlap and 
higher degree similarity exhibit greater reciprocity. 

To the best of our knowledge, this is one of the few work on modeling bivariate distributions in 
real data as well as one of the few work that takes a weighted approach to study the strength of 
reciprocity. 


4.2.2 Data Description 

In this work, we study anonymous mobile communication records of millions of users collected 
over a period of six months, December 1, 2007 through May 31, 2008. The data set contains 
both phone call and SMS interactions. 

From the whole six months’ of activity, we build three networks, Call-N, Call-D, and SMS, 
in which nodes represent users and directed edges represent phone call and SMS interactions 
between these users. Call-N is a who-calls-whom network with edge weights denoting (1) 
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Network 

N 

E 

w N 

Woijnin.) 

Network 

N 

E 

Wsms 

CALL 
CALL (m) 

1,87M 

1,75M 

49,50M 

41.84M 

483,7M 
468.7M 

915xl0 6 

885xl0 6 

SMS 

SMS (m) 

1,87M 

0,58M 

8,80M 

2,10M 

60,5M 
46,6M 


Table 4.2: Data statistics. The number of nodes N, the number of directed edges E, and the total 
weight W in the mutual (m) and non-mutual CALL and SMS networks. 


total number of phone calls, Call-D is the same who-calls-whom network with edge weights 
denoting (2) total duration of phone calls (aggregated in minutes), and SMS is a who-texts-whom 
network with edge weights denoting (3) total number of SMSs. Table 4.2 gives the data statistics. 
Global unweighted reciprocity is r=0.84 for CALL, and r=0.24 for SMS. 

When only mutual edges are considered, notice that the SMS network shrinks considerably (with 
only about one forth of the edges E remaining), whereas CALL networks remain almost intact. 
The drop in the total weight W, on the other hand, is small for all three networks. This shows 
that the majority of SMS interactions are lop-sided, however the lop-sided interactions constitute 
only a small fraction of the total volume. 


4.2.3 Patterns and Proposed 3PL Model 

Given a network of users with mutual, weighted edges between them, say Call-N, and given 
two users i and j in the network, is there a relation between the number of calls i makes to j ( Wij ) 
and the number of calls j makes to i (vj ti )7 In this section, we want to understand the association 
between the weights on the reciprocal edges in human communication networks and study their 
distribution Pr (wij,Wji) across mutual dyads. Since we study the pair-wise joint distribution, 
the order of the weights do not matter. Thus, to ease notation, we will denote the smaller of 
these weights as tist (for weight from Silent-to-Talkative) and the larger as n T s, and will study 
Pr (n S T,n T s)- 

Figure 4.7 (top-row) shows the weights tits versus nsr for all the reciprocal edges in (from left 
to right) Call-N, Call-D, and SMS. Each dot in the plots corresponds to a pair of mutual 
edges. Since there could be several pairs with the same (■ ngr , n T s) weights, the regular scatter 
plot of the reciprocal edge weights would result in over-plotting. Therefore, in order to make the 
densities of the regions clear, we show the heatmap of the scatter plots where colors represent 
the magnitude of volume (red means high volume and blue means low volume). 

In Figure 4.7, we observe that most of the points are concentrated (1) around the origin and (2) 
along the diagonal for all three networks. Concentration around the origin, for example in C ALL- 
N, suggests that the vast majority of people make only a few phone calls with nsr, f>TS < 10, and 
much fewer people make many phone calls, which points to skewness. In addition, concentration 
along the diagonal indicates that mutual people call each other mostly in a balanced fashion with 
nsr ~ nrs- Notice that similar arguments hold for Call-D and SMS. 

Even though heatmaps reveal similar patterns in all the three networks, mere visualization does 
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(a) Call-N 


(b) Call-D 


(c) SMS 


Figure 4.7: (top-row) Scatter plot heatmaps: total weight nsr (Silent to Talkative) vs the re¬ 
verse, nrs, i n l°g scales. Visualization by scatter plots suffers from over-plotting. 
Heatmaps color-code dense regions but do not have compact representations or for¬ 
mulas. Figures are best viewed in color; red points represent denser regions. The 
counts are in log 2 scale, (bottom-row) Aggregation by average: summarization and 
data aggregation, e.g. averaging, loses a lot of information. 


not provide compact representations for our data. One way to go around this issue is to do 
data summarization. For example, Figure 4.7(bottom-row) shows how tits changes with ii S t on 
average. The least square fit of the data points in log-log scales then provides a mathematical 
representation of the data. Data summarization by means of an aggregate function such as the 
average, however, loses a lot of information about the actual distribution. For instance in our 
example, the slope of the least square fit in Call-N is close to 1, which suggests that uts is 
equal to nsr on average, and does not provide any information for the deviations. This issue 
arises mostly because aggregation by the average is not a good representative, especially for 
skewed distributions. 

To further inspect the correlation between the mutual edge weights, we show the distribution of 
weight ratios in Figure 4.8 for (a) Call-N, (b) Call-D, and (c) SMS. We observe that the 
distributions follow “layers” of power-laws in all three plots. By fitting least square lines to the 
top three so-called layers of points in log-log scales, we notice that the power-law fits have similar 
exponents with shifted intercepts -many 1,2,3,... ratios; fewer 1.5,2.5, 3.5,...; even fewer 
1.33,1.66,2.33,...; and so on. This confirms our visual observation in Figure 4.7 that a similar 
pattern in the distribution of mutual edge weights holds for all three human communication 
networks we study. 
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Figure 4.8: Distribution of the ratio of weights on reciprocal edges follows “layers” of 
power-laws with similar exponents for all three types of weights (a) number of 
phone-calls Wn, (b) duration of phone-calls Wo, and (c) number of SMSs Wsms- 


Given our observation that the distribution of reciprocal edge weights ( ust , nrs) follows a simi¬ 
lar pattern across all three networks, how can we model the observed distributions? Since neither 
visualization nor aggregation qualify for compact data representation, we propose to formulate 
the distributions with the following bivariate functional form Pr (nsT,‘tiTs), which we call the 
Triple Power-Law (3PL) function. 

Proposed Model 1 (Triple Power-Law (3PL)) In human communication networks, the distri¬ 
bution Pr (ustjUts) of mutual edge weights n ST and n TS (nsr being the smaller) follows a 
Triple Power-Law in the following form 


Pr(ri5 T ,n T 5;a,,5,7) oc 


n S T n TS iP'TS — It ST + 1 ) 

Z{a,P, 7 ) 


-7 


, a > 0 , P > 0,7 > 0 , and 


n T s > n ST > 0, Z(a, p, 7 ) = =n „_, n s ^n/ s (n TS ~ n ST + 1) 7 - 


where Z is the normalization constant and M is a very large integer. 
Next we elaborate on the intuition behind the exponents a, P and 7 . 


Intuition behind the P exponent : 3PL is the 2D extension of the “rich-get-richer” phe¬ 
nomenon; people who make many phone calls will continue making even more, and even longer 
ones, leading to skewed, power-law-like distributions. The p exponent is the skewness of the 
main component, the number rirs of phone-calls from ’talkative’ to ’silent’. High 6 means more 
skewed distribution; P =0 is roughly uniform distribution. As we show in Figure 4.7, there are 
many people who make only a few (and short) phone calls and only a few people who make many 
(and long) phone calls. Visually, the vast majority of people who make only a few phone calls are 
represented with the high density (dark red) regions around the origin in all three networks. 
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Intuition behind the a exponent : Similarly, this indicates the skewness for rigr, the number 
of silent-to-talkative phone-calls. High value of a means high skewness, while a close to zero 
means uniformity. Notice that a ~ 0 for our real phone-call datasets (see Figure 4.9). 


Intuition behind the 7 exponent : It captures the skewness in asymmetry. High 7 means that 
large asymmetries are improbable. This is the case in all our real datasets. For example, in addi¬ 
tion to the origin in Figure 4.7(a), the regions along the diagonal also have high densities. These 
regions correspond to mutual pairs with about equal interaction in both directions. This suggests 
that humans tend to reciprocate their communications. 3PL also captures this observation; notice 
that the probability is higher for tits close to ust and drops for larger inequality (tits ~ nsr) as 
a power-law with exponent 7 . 


4.2.4 Comparison of 3PL to Competing Models 

In order to strengthen our case for the 3PL, we would like to rule out several plausible competing 
distributions if possible. We cannot, however, compare the 3PL fit of our data with fits to every 
competing distribution, of which there are an infinite number. The reason is, it will always 
be possible to find a distribution that fits the data better than the 3PL if we define a family of 
functions with a sufficiently large number of parameters. 

In this section, we compare our model with two well-known parametric distributions for skewed 
bivariate data from statistics, the Bivariate Pareto [Kotz et al., 2000] and the Bivariate Yule [Xekalaki, 
1986]. Their functional forms are given as two alternative competitor models as follows. 

Competitor Model 1 (Bivariate Pareto) 

/x, ,x 2 ( x ii X 2 ) = k(k + l)(ab) k+1 (axi + bx 2 + ab) k 2 , 27 , x^, a, b, k > 0. 

Competitor Model 2 (Bivariate Yule) 

fx 1 ,x 2 (x 1 ,x 2 ) = , P(2) }^ +x2 ^~ , 37 ,x 2 ,p > 0;a (j8 ) = r(a + (3)/T(a), a >0,(3 eR. 

{ P + 1)(x 1 +x 2 +2) 

We use maximum likelihood estimation to fit the parameters of each model for each of our three 
networks. In Figure 4.9, we report the best-fit parameters as well as the corresponding data log 
likelihood scores (the higher, the better). Notice that for Call-N and Call-D the 3PL achieves 
higher data likelihood than both Bivariate Pareto and Bivariate Yule. On the other hand, for 
SMS, the data likelihood scores of all three models are about the same; with Bivariate Pareto 
giving a slightly higher score. 

The simple sign of the difference between the log likelihoods (log likelihood ratio TZ), however, 
does not on its own show conclusively that one distribution is better than the other as it is subject 
to statistical fluctuation. If its true value over many independent data sets drawn from the same 
distribution is close to zero, then the fluctuations can easily change its sign and thus the results 
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of the comparison cannot be trusted. In order to make a firm judgement in favor of 3PL, we need 
to show that the difference between the log likelihoods is sufficiently large and that it could not 
be the result of a chance fluctuation. To do so, we need to know the standard deviation a on 7 Z, 
which we estimate from our data using the method proposed in [Vuong, 1989]. 

In Figure 4.9, we report the normalized log likelihood ratio denoted by z = 7 Z/y/2ncr, where 
n is the total number of data points (number of mutual edge pairs in our case). A positive 
z value indicates that the 3PL model is truly favored over the alternative. We also show the 
corresponding p- value, p = erfc(z), where erfc is the complementary Gaussian error function. It 
gives an estimate of the probability that we measured a given value of 7 Z when the true value of 
1Z is close to zero (and thus cannot be trusted). Therefore, a small p value shows that the value 
of 7 Z is unlikely to be a chance result and its sign can be trusted. 



Call-N 

Call-D 

SMS 

Triple Power Law (3PL) 

a 

le-06 

le-06 

0.8120 

P 
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1.8670 

1.5896 

7 

0.8204 

0.9650 

0.3005 

Loglikelihood 

-7.55e+07 

-8.88e+07 

-5.41e+06 

Bivariate Pareto 

k 
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0.7657 

0.7862 

a 

0.2119 

0.5723 

0.7097 

b 

lOe+05 

1.25e+04 

0.7553 

Loglikelihood 

-7.77e+07 
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-5.39e+06 

z 

803.73 
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-41.06 

P 

0 

0 

0 

Bivariate Yule 

P 

l.lle-16 

5.55e-17 

le-06 

Loglikelihood 

-8.59e+07 

-10.00e+07 

-5.41e+06 

z 

2.14e+03 

1.93e+03 

1.49 

P 

0 

0 
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Figure 4.9: Maximum likelihood parameters estimated for 
3PL, Bivariate Pareto and the Bivariate Yule and 
data log-likelihoods obtained with the best-fit pa¬ 
rameters. We also give the normalized log likeli¬ 
hood ratios z and the corresponding p-values. A 
positive (and large) z value indicates that 3PL is 
favored over the alternative. A small p-value con¬ 
firms the significance of the result. Notice that 3PL 
provides significantly better fits to CALL and is 
as good as its competitors for SMS. 


Notice that the magnitude of z for 
Call-N and Call-D is quite large, 
which makes the p-value zero and 
shows that 3PL is a significantly 
better fit for those data sets. On 
the other hand, z is relatively much 
smaller for SMS, therefore we con¬ 
clude that 3PL provides as good of 
a fit as its competitors for this data 
set. Note that the number of mutual 
edge pairs n in SMS (~ 1 million) 
is much smaller compared to that of 
the call networks (»21 million) (Ta¬ 
ble 4.2). It is worth emphasizing 
that difference, because the bivari¬ 
ate pattern of reciprocity might re¬ 
veal itself better in larger data sets, 
and it would be interesting to see 
whether 3 PL provides a better fit for 
SMS when more data samples be¬ 
come available. 

To this end, we also computed the 
Kolmogorov-Smirnov (KS) statistic, 
which is given as 

max XuX2 \F n (x 1 ,x 2 ) - F(x 1 ,x 2 )\ 

in which F n is the empirical joint 
cumulative distribution function (CDF) 
of our data, and F is that from an 
estimated model. Unlike in one di¬ 
mension, there is not a closed form 
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for its distribution under the null, but we can find it by simulation, (note that with bivariate 
data, the joint cumulative distribution function is F(xi, X 2 ) = Prob(X\ < xi,A" 2 < X 2 )). In 
Table 4.3 we report the KS statistic of all three models for all of our three data sets. Notice that 
3PL provides the smallest distance to the empirical distribution for all our call data sets under 
the KS test. 



Call-N 

Call-D 

SMS 

Triple Power Law 

0.0189 

0.0810 

0.0716 

Bivariate Pareto 

0.720 

0.1147 

0.0113 

Bivariate Yule 

0.2605 

0.1661 

0.0818 


Table 4.3: Kolmogorov-Smirnov statistic; the maximum absolute difference between the joint 
empirical CDF computed from our data and the joint CDF estimated from three dif¬ 
ferent models. The smaller values indicate a better fit to real data. 


Next, we would like to demonstrate also visually that 3PL provides a better fit to the real data 
than its competitors. To this end, having estimated the model parameters for all three models, 
we generated synthetic data sets with the same number of samples as in each of our networks. 
We show the corresponding plots for Call-N in Figure 4.10 (a) for real data, and synthetic data 
generated by (b) 3PL, (c) Bivariate Pareto, and (d) Bivariate Yule. We notice that the simulated 
data distribution from 3PL looks more realistic than its two competitors. Similar results for 
Call-D and SMS are omitted for brevity. 




(a) Call-N (real data) 


(b) 3PL (synt.) 


(c) Biv. Pareto (synt.) (d) Biv. Yule (synt.) 


Figure 4.10: Contour-maps for the scatter plot tits versus ust in Call-N (a) for real data, and 
synthetic data simulated from (b) 3PL, (c) Bivariate Pareto and (d) Bivariate Yule 
functions using the best-fit parameters. Notice that synthetic data generated by 3PL 
looks more similar to the real data than its competitors also visually. Counts are in 
log 2 scale. Figures are best viewed in color. 


Goodness of Fit 

The likelihood ratio test is used to compare two models to determine which one provides a 
better fit to a given data. However, as we mentioned in the previous section, it cannot directly 
show when both competing models are poor fits to the data; it can only tell which is the least 


53 



































(a) Call-N (b) Call-D (c) SMS 


Figure 4.11: Distribution of Hi = x 2 ^) for all data points i according to cumulative 

distribution function (CDF) F estimated from our 3PL model. An approximately 
uniform distribution of Ui shows that 3PL provides a good fit to real data. 


bad. Therefore, in addition to showing that 3PL provides a better (or as good) fit than its two 
competitors, we also need to demonstrate that it indeed provides a good fit itself. 

A general class of tests for goodness of fit work by transforming the data points x 2)i ) ac¬ 
cording to a cumulative distribution function (CDF) F as rq = F(xi t i,x 2ji ) for VI, 1 < i < n. 
One can show that if F is the correct CDF for the data, u, should be uniformly distributed (deriva¬ 
tion follows from basic probability theory). That is, if the CDF F estimated from our model is 
approximately correct, the empirical CDF of the u t = F(x ]ai x 2j ) should be approximately a 
straight line from (0, 0) to (1,1). 

For each of our three data sets, we generate synthetic data drawn from our 3PL function with 
the corresponding estimated best-fit parameters. Then, we compute Ui = F(x l/I: , x 2 f) for all the 
data points in each of the data sets, where F is the estimated CDF from each synthetic data. 
In Figure 4.11, we show the CDF of u, as well as the CDF for a perfect uniform distribution. 
Notice that the distribution of Hi is almost uniform for Call-N and Call-D, and quite close to 
the uniform for SMS. This corroborates our case that our model provides a good approximate to 
the correct CDF of our data sets, and thus indeed provides a good fit. 


4.2.5 3PL at Work 

There exist at least three levels at which we can make use of parametric statistical models for 
real data: (1) as data summary: compact mathematical representation, data reduction; (2) as 
simulators: generative tools for synthetic data; (3) in anomaly detection: probability density 
estimation. 

In Figure 4.12 (a), we show top 100 pairs in Call-D with lowest 3PL likelihood (marked with 
triangles). Figure 4.12 (b) shows the local neighborhood of one of the pairs, say A and B (marked 
with circles in (a)). We notice low mutuality; A initiated 99% of the calls in return to less than 
2 hours total duration of calls B made. Further inspection revealed constant daily activity by A, 
including weekends, with about 7 hours call duration per day on average, starting at around 9am 
in the morning until around 5-8pm in the evening. It is also surprising that all these calls are 
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addressed to the same contact, B. While for privacy reasons, we cannot fully tell the scenario 
behind this behavior, this proves to be an interesting case for the service operator to further look 
into. Other interesting anomalous observations are omitted for brevity. 


In Call-N, one of the outliers we found 
is a mutual pair of users, say A and B, 
where A initiated more than 2600 calls in 
reply to 45 calls only. As it looks sus¬ 
picious, we examined the spread of these 
calls over the course of six months for a 
sanity check. To our surprise, we found 
out that for two consecutive months, A 
called B 40 to 60 times every single day, 
constantly. About 90% of these calls 
lasted less than 10 minutes, and more 
than 50% took less than 2 minutes. In ad¬ 
dition, all the calls lasted strictly less than 
1 hour (see Figure 4.13). This shows that 
a vast majority of the many calls between 
this pair were quite short. 




Figure 4.12: (a) Least likely 100 points by 3PL (shown 
with triangles), (b) Local neighborhood 
of one mutual pair detected as an outlier 
(marked with circles). Edge thickness is pro¬ 
portional to edge weight. 


One scenario for this situation could be 
that there exists a specific package from 
the mobile service offering a constant 
rate for all the calls that takes 10 min¬ 
utes or less. And thus for longer calls the 
parties had to hang up and initiated an¬ 
other call, which would result in exces¬ 
sive number of phone calls as observed. 
Another scenario could be the case of 
faulty equipment dropping calls quite fre¬ 
quently and thus making the parties re¬ 
dial. Anyhow, we believe that the high 
non-reciprocity as well as the large num¬ 
ber of calls initiated presents an interest¬ 
ing case for the service operator to further 
look into. 



duration (30 sec. intervals) 

Figure 4.13: Distribution of the duration of over 2600 
calls detected as an anomaly in Call-N. 
Notice that a majority of the calls took less 
than 10 minutes. 


4.2.6 Reciprocity and Local Network Topology 

Given that person i calls person j Wij times and person j calls person i wji times, what is the 
degree of reciprocity between them? In this section, we discuss several weighted metrics that 
quantify reciprocity between a given mutual pair. Later, we study the relationship between reci¬ 
procity among mutual pairs and their topological similarity. 
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Figure 4.14: Spread of calls by days for the anomaly detected in Call-D. Notice that the calls 
are made during the entire duration of working hours. 


Weighted reciprocity metrics 


Three metrics we considered in this work to quantify the “similarity” or “balance” of weights w 


and Wji are (1) Ratio r = 


,Wji) 


G [0,1], (2) Coherence c = 


_ 2 '/WjjWjj 

( Wij+Wji ) 




G [0,1] (geometric 


un-i ctx v_/ \ x / xvurn/ / — / \ 

J L v 7 ma x{Wij ,Wji) 

mean divided by the arithmetic mean of the edge weights), and ( 3)Entropy e = —Pij log 2 (Pij) 


Pjilog 2 (pji) G [0,1], where p {j = 


— and pji = 1 — p^. All these metrics are equal to 0 for 


J 3D ^ - ( Wij+Wjt 

the (non-mutual) pairs where one of the edge weights is 0, and equal to 1 when the edge weights 
are equal. Although these metrics are good at capturing the balance of the edge weights, they 
fail to capture the volume of the weights. For example, human would score (wjj=100,1^=100) 
higher than (wji= 1, rrypl), whereas all the metrics above would treat them as equal. 


Therefore, we propose to multiply these metrics by the logarithm of the total weight, such that 
the reciprocity score consists of both a “balance” as well as a “volume” term. In the rest of this 
section, we use the weighted ratio r w = log(t% + Wji) as the reciprocity measure in 

our experiments. The results are similar for the other weighted metrics, c w and e w . 


Reciprocity and Network Overlap 

Here, we want to understand whether there is a relation between the local network overlap (local 
density) and reciprocity between mutual pairs. Local network overlap of two nodes is simply the 
number of common neighbors they have in the network. 

In Figure 4.15, we show the cumulative distribution of reciprocity separately for different ranges 
of overlap. The figures suggest that people with more common contacts tend to exhibit higher 
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reciprocity, both in their SMS and phone call interactions. 



0< #CN < 

=10 

10< 

#CN 

A 

II 

ro 

o 

— 20< 

#CN 

<=30 

—30< 

#CN 

<=50 

—50< 

#CN 

A 

II 

O 

O 


5 10 

reciprocity, x 

(a) Call-N 


15 



0< #CN < 

II 

o 

o 

A 

#CN 

A 

II 

ro 

o 

—20< 

#CN 

o 

CO 

II 

V 

V 

o 

CO 

1 

#CN 

O 

LO 

II 

V 

—50< 

#CN 

A 

II 

O 

O 


5 10 15 

reciprocity, x 

(b) Call-D 


20 



O 

A 

o 

2 

A 

II 

o 

O 

A 

#CN 

<=20 

V 

O 

CM 

#CN 

<=30 

V 

O 

CO 

#CN 

<=50 

—50< 

#CN 

A 

II 

O 

O 


5 10 15 

reciprocity, x 

(c) SMS 


20 


Figure 4.15: Complementary cumulative distribution of reciprocity for different ranges of local 
network overlap (number of Common Neighbors). Notice that the more the number 
of common contacts, the higher the reciprocity. 


Reciprocity and Degree Similarity 


Next, we investigate the relation between 
the degree similarity (degree assortativ- 
ity) and reciprocity. In Figure 4.16, we 
show the heatmap for the average reci¬ 
procity among pairs with respective de¬ 
grees di and dj for Call-N (similar fig¬ 
ures for other networks are omitted for 
brevity). The heatmap plot suggests that 
two people with more similar number of 
contacts exhibit larger reciprocity; notice 
the increase in reciprocity with increas¬ 
ing dj for fixed di (from bottom to diag¬ 
onal, towards degree similarity) and then 
the drop from diagonal to the right, to¬ 
wards degree dissimilarity. 



20 40 60 80 100 

d. 


Figure 4.16: Average reciprocity among dyads with de¬ 
grees (di,dj) in Call-N. Notice that reci¬ 
procity is higher among pairs with similar 
degrees (number of contacts). Red color rep¬ 
resent higher average reciprocity. Figure is 
best viewed in color. 


Discussion 

Our observations suggest that people who have many common contacts as well as similar number 
of contacts exhibit higher reciprocity, i.e., they play equally active roles in maintaining their 
relationship. In this discussion, we put these findings into context and relate them to the network 
topology at large. 

Assume that our networks consist of hubs and communities. Communities would represent 
tightly connected groups of people (e.g., social circle of friends), and hubs would represent the 
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“connector” nodes that are linked from several different communities (e.g. call centers, pizza 
shops, etc.). Intuitively, the reciprocity is higher among friends in the same social circle, that is 
within communities, and much lower between a hub node and its neighbors. 

We believe that our results confirm this hubs and communities structure assumption. The reason 
is that, communities are clique-like structures where the nodes in the clique are about the same 
degree and share many common neighbors. In contrast, hubs exhibit star-like structures; they 
share much fewer common neighbors and have much lower degree similarity with their neigh¬ 
bors. Therefore, the existence of a hubs and communities structure in our networks is a sufficient 
condition that would give us our observed results. 

We reached these conclusions by studying dyadic relations only. It would be interesting to con¬ 
duct a similar study at the communities level and see how reciprocity changes between different 
communities. We conjecture that similar results would hold, due to the communities-within- 
communities structure also observed in real networks [Chakrabarti et al., 2004b; Girvan and 
Newman, 2002]. 


4.3 Patterns in call durations 

4.3.1 Motivation 

In the study of phone calls databases [Onnela et al., 2007a; Seshadri et al., 2008; Willkomm et al., 
2008], a common technique to ease the analysis of the data is the summarization of the phone 
calls records into aggregated attributes [Hill et al., 2007], such as the aggregate calls duration or 
the total number of phone calls. By doing that, the size of the database can be reduced by orders 
of magnitudes, allowing the execution of most well known data mining algorithms in a feasible 
time. However, we believe that such representation veils relevant temporal information inherent 
in a user or in a relationship between two people. When all the information about the phone calls 
records of a user is aggregated into single summarized attributes, we do not know anymore how 
often this user calls or for how long he talks per phone call. One may suggest, for instance, to 
use descriptive statistics such as mean and variance to describe the duration of the user’s phone 
calls, but it is well known that the distribution of these values is highly skewed [Willkomm et al., 
2008], what invalidate the use of such statistics. 

In this work, we tackle the following problem. Given a very large amount of phone records, what 
is the best way to summarize the calling behavior of a user? In order to answer this question, 
we examine phone call records obtained from the network of a large mobile operator of a large 
city. More specifically, we analyze the duration of hundreds of million calls and we propose the 
Truncated Lazy Contractor (TLAC ) model to describe how long are the durations of the phone 
calls of a single user. Thus, the TLAC models the Calls Duration Distribution (CDD) of a user 
and is parsimonious, having only two parameters, the efficiency coefficient p and the weakness 
coefficient f3. We show that the TLAC model was the best alternative to model the CDD of the 
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users of our dataset, mainly because it has a heavier tail and head than the log-normal distribution, 
that is the most commonly used distribution to model CDDs [Guo et al., 2007]. 

We also suggest the use of the TLAC parameters as a better way to summarize the calls duration 
behavior of a user. We propose the MetciDist to model the population of users that have a deter¬ 
mined calls duration behavior. The MetaDist is the meta-distribution of the p, and parameters 
of each user V s CDD and, when its isocontours are visualized, its shape is surprisingly simi¬ 
lar to a bivariate Gaussian distribution. This fascinating regularity, observed in a significantly 
noisy data, makes the MetaDist a potential distribution to be explored in the direction of better 
understanding the call behavior of mobile users. 

In summary, the main contributions of this work are: 

• The proposal of the TLAC model to represent the individual phone calls durations of mo¬ 
bile customers; 

• The MetaDist to model the group call behavior of the mobile phone users; 

• The use of the MetaDist and the Focal Point to describe the collective temporal evolution 
of large groups of customers; 

As an additional contribution, we show the usefulness of the TLAC model. We show that it 
can spot anomalies and it can succinctly verify correlations (or lack thereof) between the TLAC 
parameters of the users and their total number of phone calls, aggregate duration and distinct 
patterns. We also emphasize that the TLAC model can be used to generate synthetic datasets and 
to significantly summarize a very large number of phone calls records. 


4.3.2 Patterns and Proposed TLAC Model 

In this work, we analyze mobile phone records of millions of mobile phone users in our CALL 
dataset (see Table 4.2), during four months. In this period, about half a billion phone calls were 
registered and, for each phone call, we have information about the duration of the phone call, the 
date and time it occurred and encrypted values that represent the source and the destination of 
the call, that may be mobile or not. When not stated otherwise, the results shown in this work 
refer to the phone call records of the first month of our dataset. The results for the other 3 months 
are explicitly mentioned in the next section. 


Problem Definition 

The Call Duration Distribution (CDD) is the distribution of the call duration per user in a period 
of time, that in our case, is one month. In the literature, there is no consensus about what well 
known distribution should be used to model the CDD. There are researchers that claim that the 
PDD should be modeled by a log-normal distribution [Guo et al., 2007] and others that it should 
be modeled by the exponential distribution [Tejinder S. Randhawa, 2003]. Thus, in this section, 
we tackle the following problem: 
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Problem 1 CDD FITTING. Given d\, d 2 , ..., d n , durations of rii phone calls made by a user i in 
a month, find the most suitable distribution for them and report its parameters. 

As we mentioned before, there is no consensus about what well known distribution should be 
used to model the CDD, i.e., for some cases the log-normal fits well and for others, the expo¬ 
nential is the most appropriate distribution. Thus, finding another specific random distributions 
that could provide good fittings to a particular group of CDDs would just add another variable to 
Problem 1, without solving it. Therefore, we propose that the distribution that solves Problem 1 
should necessarily obey to the following requirements: 

• Rl: Intuitively explain the intrinsic reasons behind the calls duration; 

• R2: Provide good reliable fits for the great majority of the users. 

In the following sections, we present a solution for Problem 1 and tackle Requirement Rl by 
presenting the TLAC model, which is an intuitive model to represent CDDs. Then, we tackle 
Requirement R2 by showing the goodness of fit of the TLAC model for our dataset. 


TLAC Model 

Given these constraints, we start solving Problem 1 by explaining the evolution of the calls 
duration by a survival analysis perspective. We consider that all the calls C\ , c 2 ,..., cc made by a 
user in a month are individuals which are alive while they are active. When a phone call Cj starts, 
its initial lifetime lj = 1 and, as time goes by, lj progressively increments until the call is over. It 
is obvious that the final lifetime of every Cj would be its duration dj. 

In the survival analysis literature, an interesting survival model that can intuitively explain the 
lifetime, i.e. duration, of the phone calls is the log-logistic distribution. And besides its use 
in survival analysis [Bennett, 1983; Lawless and Lawless, 1982; Mahmood, 2000], there are 
examples in the literature of the use of the log-logistic distribution to model the distribution 
of wealth [Fisk, 1961], flood frequency analysis[M.I. Ahmad and Werritty, 1988] and software 
reliability[Gokhale and Trivedi, 1998]. All of these examples present a modified version of the 
well known “rich gets richer” phenomenon. First, for a variable to be “rich”, it has to face several 
risks of “dying” but, if it survives, it is more likely to get “richer” at every time. We propose that 
the same occurs for phone calls durations. After the initial risks of hanging up the call, e.g., 
wrong number calls, voice mail calls and short message calls such as “I am busy, talk to you 
later” or “I am here. Where are you?” type of calls, the call tends to get longer at every time. 
As an example, the lung cancer survival analysis case [Bennett, 1983] parallels our environment 
if we substitute endurance to disease with propensity to talk: a patient/customer that has stayed 
alive/talking so far, will remain such, for more time, i.e., the longer is the duration of the call so 
far, the more the parties are enjoying the conversation and the more the call will survive. 

Thus, to solve Problem 1, we propose the Truncated Lazy Contractor (TLAC ) model, that 
is a truncated version of the log-logistic distribution, since it not contains the interval [0,1). 
Firstly we show, in Figure 4.17-a, the Probability Density Function (PDF) of the TLAC , the 
log-normal and exponential distributions, in order to emphasize the main differences between 
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these models. The parameters were chosen accordingly to a median call duration of 2 minutes 
for all distributions. The TLAC and log-normal distributions are very similar, but the TLAC 
is less concentrated in the median than the log-normal, i.e., it has power law increase ratios in 
its head and in its tail. We believe that this is another indication that the TLAC is suitable to 
model the users’ CDD, since as it was verified by [Willkomm et al., 2008], CDDs have semi¬ 
heavy” tails. The basic formulas for the log-logistic distribution and, consequently, for the TLAC 
, are [Lawless and Lawless, 1982]: 


PDF T lac {%) = 
CDFtlac(x) = 


exp(z(l + a) — /i) 
(cr(l + e z )) 2 
1 

l + exp (_M^l) 

z = (ln(x) — n)/a 


where /i is the location parameter and a the shape parameter. 




Figure 4.17: Comparison of shapes of log-normal, exponential and TLAC distributions. 


Moreover, in finite sparse data that spans for several orders of magnitude, that is the case of 
CDDs when they are measured in seconds, it is very difficult to visualize the PDFs, since the 
distribution is considerably noisy at its tail. One option is to smooth the data by reducing its 
magnitude by aggregating data into buckets, with the cost of lost of information. Another option 
is to move away from the PDF and analyze the cumulative distributions, i.e., cumulative density 
function (CDF) and complementary cumulative density function (CCDF) [Clauset et al., 2009]. 
These distributions veil the sparsity of the data and also the possible irregularities that may occur 
for any particular reason. However, by using the CDF (CCDF) you end up losing the information 
in the tail (head) of the distribution. In order to escape from this drawbacks, we propose the use 
of the Odds Ratio (OR) function, that is a cumulative function where we can clearly see the 
distribution behavior either in the head and in the tail. This OR{t ) function is commonly used in 
the survival analysis and it measures the ratio between the number of individuals that have not 
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survived by time t and the ones that survived. Its formula is given by: 


OR{t) 


CDF TLAC (t) 

1 — CDF TLAC (t) 


(4.7) 


Therefore, in Figure 4.17-b, we plot the OR function for the TLAC , the log-normal and ex¬ 
ponential distributions. The OR function of the exponential distribution is a power law until t 
reaches the median, and then it grows exponentially. On the other hand, the OR function of the 
log-normal grows slowly in the head and then fast in the tail. Finally, the OR function for the 
TLAC is the most interesting one. When plotted in log-log scales, is a straight line, i.e., it is a 
power law. Thus, as shown in [Bennett, 1983], the OR(t) function can be summarized by the 
following linear regression model: 


ln(Oi?(t)) = p\w{t) + j3 (4.8) 

OR{t) = eh p (4.9) 

In our context, Equation 4.8 means that the ratio between the number of calls that will die by 

time t and the ones that will survive grows with a power of p. Moreover, given that the median 

t of the CDD is given when OR(t ) = 1 and OR(t ) < 1 when t < t, the probability of a call 
to end grows with t when t < t and then decrease forever. We call this phenomenon the “lazy 
contractor” effect, which represents the time a lazy contractor takes to complete a job. If the job 
is easier and does require less effort than the ordinary regular job, he finishes it fast. However, 
for jobs that are harder and that demand more work than the ordinary regular job, the contractor 
also gets more lazier and takes even more time to complete it, i.e., the longer a job is taking to be 
completed, the longer it will take. The p and the /3 are the parameters of the TLAC model, with 
P = l/ 0- - 

We conclude this section and, therefore, the first part of the solution to Problem 1, by explaining 
the intuition behind the parameters of the TLAC model. The parameter p is the efficiency coeffi¬ 
cient, which measures how efficient is the contractor. The higher the p, the more efficient is the 
contractor and the faster he will complete the job. On the other hand, the location parameter {3 is 
the weakness coefficient, which gives the duration t of the typical regular job a contractor with 
a determined efficiency coefficient p can take without being lazy, where t = exp(—f3 / p). This 
means that the lower the (3 , the harder are the jobs that the contractor is used to handle. 


Goodness of Fit 

In this section, we tackle the second requirement of Problem 1 by showing the goodness of fit 
of our TLAC model. First, we show in Figure 4.18-a, the PDF of the CDD for a high talkative 
user, with 3091 calls, and with the values put in buckets of 5 seconds to ease the visualization. 
We also show the best fittings using Maximum Likelihood Estimation (MLE) for the exponential 
and the log-normal distributions and also for our proposed TLAC model. Visually, it is clear that 
the best fittings are the ones from the log-normal distribution and the TLAC distribution, with the 
exponential distribution not being able to explain either the head and the tail of the CDD. 
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However, by examining the OR plot in Figure 4.18-b, we clearly see the the TLAC model provide 
the best fitting for the real data. As verified for the exponential distribution in the PDF, in the 
OR case, the log-normal also could not explain either the head and the tail of the CDD. We also 
point out that we can see relevant differences between the TLAC model and the real data only 
for the first call durations, that happen because these regions represent only a very small fraction 
of the data. The results showed in Figure 4.18 once more validate our proposal that the TLAC is 
a good model for CDDs. 




Figure 4.18: Comparison of models for the distribution of the phone calls duration of a high 
talkative user, with 3091 calls. TLAC in red, log-normal in green and exponential 
in black. Visually, for the PDF both the TLAC and the log-normal distribution 
provide good fits to the CDD but, for the OR, the TLAC clearly provide the best fit. 


Given our initial analysis, we may state that the TLAC seems to be a good fit for the CDDs and 
also serve as an intuitive explanation for how the durations of the calls are generated. However, 
in order to conclude our answer for Problem 1, we must verify its generality power and also 
compare it to the log-normal and exponential generality power as well. Thus, we verify which 
one of the distributions can better fit the CDD of all the users of our dataset that have n > 30 
phone calls. We calculated, for every user, the best fit according to the MLE for the TLAC , the 
log-normal and exponential distributions and we performed a Kolmogorov-Smimov goodness of 
fit test [Massey, 1951], with 5% of significance level, to verify if the user’s CDD is either one 
of these distributions. From now on, every time we mention that a distribution was correctly 
fitted, we are implying that we succesfully performed a Kolmogorov-Smirnov goodness of fit 
test. 

In Figure 4.19, we show the percentage of CDDs that could be fitted by a log-normal, a TLAC and 
an exponential distribution. As we can see, the TLAC distribution can explain the highest fraction 
of the CDDs and the exponential distribution, the lowest. We observe that the TLAC distribution 
correctly fit almost 100% of the CDDs for users with n < 1000. From this point, the quality of 
the fittings starts to decay, but significantly later than the log-normal distribution. We emphasize 
that the great majority of users have n < 1000, what indicates that some of these talkative users’ 
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CDD are probably driven by non natural activities, such as spams, telemarketing or other strong 
commercial-driven intents. This result, allied to the fact that the TLAC distribution could model 
more than 96% of the users, make it reasonable to answer Problem 1 claiming that the TLAC 
distribution is the standard model for CDDs in our dataset. 



Figure 4.19: Percentage of users’ CDDs that were correctly fitted vs. the user’s number of calls 
c. The TLAC distribution is the one that provided better fittings for the whole 
population of customers with c > 30. It correctly fitted more than 96% of the 
users, only significantly failing to fit users with c > 10 3 , probably spammers, 
telemarketers or other non-normal behavior user. 


Finally, we further explore Problem 1 by looking at the OR of the talkative users that were not 
correctly fitted by the TLAC model. In Figure 4.20, we show the OR for three of these users 
and, as we observe, even these customers have a visually good fitting to the TLAC model. These 
results corroborate even more with the generality power of TLAC . Despite the fact that the 
irregularities of these customers’ CDDs unable them to be correctly fitted by the TLAC model, 
it is clear that the TLAC can represent their CDDs significantly well. 



Duration 


(a) 



(b) 



(c) 


Figure 4.20: Odds ratio of 3 talkative customers that were not correctly fitted by TLAC . 
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4.3.3 TLAC Over Time 


We know it is trivial to visualize the distribution of users with a determined summarized attribute, 
such as number of phone calls per month or aggregate calls duration. However, if we want to 
visualize the distribution and evolution of a temporal feature of the user such as his CDD, things 
start to get more complicated. Thus, in this section, we tackle the following problem: 

Problem 2 EVOLUTION. Given the pi and Bi. parameters of N customers (i — 1, 2...,N), de¬ 
scribe how they collectively evolve over time. 

We propose two approaches to solve Problem 2. In the next two sections, we describe the 
MetaDist solution and the Focal Point approach, respectively. 


Group Behavior and Meta-Fitting 

Since we know that the great majority of users’ CDD can be modeled by the TLAC model, in 
order to solve Problem 2, we need to figure out how each user % is distributed according to their 
parameters pi and j3i of the TLAC model. If the meta-distribution of the parameters pt and ff is 
well defined, then we can model the collective call behavior of the users and see its evolution over 
time. From now on, we will call the meta-distribution of the parameters pi and {B r the MetaDist 
distribution. 

In Figure 4.21-a, we show the scatter plot of the parameters pi and B % of the CDD of each user i for 
the first month of our dataset. We can not observe any latent pattern due to the overplotting but, 
however, we can spot outliers. Moreover, by plotting the pi and B t parameters using isocontours, 
as shown in Figure 4.21-b, we automatically smooth the visualization by disconsidering low 
populated regions. While darker colors mean a higher concentration of pairs pi and /3i, white 
color mean that there are no users with CDDs with these values of pi and fBi. 


-5 

cd" -6 

-7 

-8 

^0 2 4 6 8 
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(a) Rough scatter plot (b) Isocontours of the real data (c) Bivariate Gaussian fitting. 

Figure 4.21: Scatter plot of the parameters pi and Bi of the CDD of each user i for the first month 
of our dataset. In (a) we can not see any particular pattern, but we can spot outliers. 
By plotting the isocontours (b), we can observe how well a bivariate Gaussian (c) 
fits the real distribution of the pi and /3i of the CDDs (’meta-fitting’) 
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Surprisingly, we observe that the isocontours of Figure 4.21-b are very similar to the ones of 
a bivariate Gaussian. In order to verify this, we extracted from the MetaDist distribution the 
means P and B of the parameters p, and respectively, and also the covariance matrix E. We 
use these values to generate the isocontours of a bivariate Gaussian distribution and we plotted it 
in Figure 4.21-c. We observe that the isocontours of the generated bivariate Gaussian distribution 
are similar to the ones from the MetaDist distribution, which indicates that both distributions are 
also similar. Thus, we conjecture that a bivariate Gaussian distribution fits the real distribution 
of p and /3s, making the MetaDist a good model to represent the population of users with a 
determined calls duration behavior. 

Given that the MetaDist is a good model for the group behavior of the customers in our dataset, 
we can now visualize and measure how they evolve over time. In Figure 4.22 we show the 
evolution of the MetaDist over the four months of our dataset. The first observation we can make 
is that the bivariate Gaussian shape stands well during the whole analyzed period, what validates 
the robustness of the MetaDist. Moreover, a primary view indicates that the meta-parameters 
also have not changed significantly over the months. This can be confirmed by the first 5 rows 
of Table 4.4, which describes the value of the meta-parameters P, B and E(cr^., cr|., cov(pi , (3,)) 
for the four analyzed months. This indicates that the phone company already reached a stable 
state before its customers concerning its prices, plans and services. In fact, the only noticeable 
difference occurs between the first month and the others. We observe that the meta-parameters of 
the first month have a slightly higher variance than the others, what indicates that this is probably 
an atypical month for the residents of the country in which our phone records were collected. 
But in spite of that, in general, the meta-parameters do not change through time. Then, we can 
state the following observation: 

Observation 4.3.1 TYPICAL BEHAVIOR. The typical human behavior is to have a efficiency 
coefficient p ~ 1.59 and a weakness coefficient f3 ~ — 6 . 25 . Thus, the median duration for a 
typical mobile phone user is 51 seconds and the mode is 20 seconds. 



Figure 4.22: Evolution of the MetaDist over the four months of our dataset. Note that the col¬ 
lective behavior of the customers is practically stable over time. 


Focal Point 

An interesting observation we can derive from the MetaDist showed in Figure 4.21 is that there 
exists a significant negative correlation between the parameters pi and 3 t . This negative correla- 
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tion, more precisely of —0.86, lead us to the fact that the OR lines, i.e., the TLAC odds ratio plots 
of the customers of our dataset, when plotted together, should cross over a determined region. In 
order to verify this, we plotted in Figure 4.23-a the OR lines for some customers of our dataset. 
As we can observe, it appears that these lines are all crossing in the same region, when the dura¬ 
tion is approximately 20 seconds and the odds ratio approximately 0.1. Then, in Figure 4.23-b, 
we plotted together the OR lines of 20,000 randomly picked customers and derived from them 
the isocontours to show the most populated areas. As we can observe, there is a highly populated 
point when the duration is 17 seconds and the OR is 0.15. By analyzing the whole month dataset, 
we verified that more than 50% of the users have OR lines that cross this point. From now on, 
we call this point the Focal Point. 



(a) Direct plot. 


(b) Isocontours of the plot. 


Figure 4.23: The TLAC lines of several customers plotted together. We can observe that, 
given the negative correlation of the parameters pi and A, that the lines tend to 
cross in one point (a). We plot the isocontours of the lines together and approxi¬ 
mately 50% of the customers have TLAC lines that pass on the high density point 
(duration=17s, OR=0.15) (b). 


Formally, the Focal Point is a point on the OR plot with two coordinates: a coordinate FP duration 
in the duration axis and a coordinate FP Q r in the OR axis. When a set of customers have their 
OR plots crossing at a Focal Point with coordinates (FP duration , FP 0 r), it means that for all 
these customers the h percentile of their CDD is on FP duration seconds. Thus, in the 

2 bottom lines of Table 4.4, we describe the Focal Point coordinates for the four months of 
our analysis and, surprisingly, the Focal Point is stationary. Thus, we can make the following 
observation: 

Observation 4.3.2 UNIVERSAL PERCENTILE. The vast majority of mobile phone users has 
the same 10 th percentile, that is on 17 seconds. 

Observation 4.3.2 suggests that one of the risks for a call to end acts in the same way for everyone. 
We conjecture that, given the 17 seconds durations, this is the risk of a call to reach the voice 
mail of the destination’s mobile phone, i.e., the callee could not answer the call. The duration of 


67 












this call involves listening to the voice mail record and leaving a message, what is coherent with 
the 17 seconds mark. It would be interesting to empirically verify the percentage of phone calls 
that reaches the voice mail and compare with the Focal Point result. 


- 

1st month 

2nd month 

3rd month 

4th month 

p 

1.59 

1.58 

1.59 

1.59 

B 

-6.16 

-6.28 

-6.32 

-6.30 

a l 

0.095 

0.086 

0.084 

0.083 

< 

1.24 

0.98 

0.95 

0.94 

cov(pi,pi) 

-0.30 

-0.24 

-0.24 

-0.23 

P Pduration () 

17 

17 

17 

17 

FPqr 

0.15 

0.12 

0.11 

0.11 


Table 4.4: Evolution of the meta-parameters (rows 1-5) and the Focal Point (rows 6-7) during 
the four months of our dataset. 


4.3.4 TLAC at Work 

In the previous section, we showed the collective behavior of millions of mobile phone users 
is stationary over time. We described two approaches to do that, one based on the MetaDist 
and the other based on the Focal Point . The initial conclusions of both approaches are same. 
First, the collective behavior of our dataset is stable, i.e., it does not change significantly over 
time. Second, we could see a slight difference between the first month and the others, indicating 
that this month is an atypical month in the year. We believe that these two approaches can 
succinctly and accurately aid the mobile phone companies to monitor the collective behavior of 
their customers over time. 

Moreover, since we could successfully model more than 96% of the CDDs as a TLAC , a natural 
application of our models would be for anomaly detection and user classification. A mobile 
phone user that does not have a CDD that can be explained by the TLAC distribution is a potential 
user to be observed, since he has a distinct call behavior from the majority of the other users. To 
illustrate this, we show in Figure 4.24 a talkative node with a CDD that can not be modeled by 
a TLAC distribution. We observe that this node, indeed, has an atypical behavior, with his CDD 
having a noisy behavior from 10 to 100 seconds and also an impressive number of phone calls 
with duration around 1 hour (or 5 x 700 seconds). Moreover, another way to spot outliers is 
to check which users have a significant distance from the main cluster of the MetaDist . As we 
showed in Figure 4.21-a, this can be easily done even visually. 

Another application that emerges naturally for our models is the summarization of data. By 
modeling the users’ CDD into TLAC distributions, we are able to summarize, for each user i, 
hundreds or thousands of phone calls into just two values, the parameters pi and % of the TLAC 
model. In our specific case, we could summarize over 0.1 TB of phone calls data into less than 
80 MB of data. In this way, it is completely feasible to analyze several months, or even years 


68 





Figure 4.24: Outlier whose CDD can not be modeled by the TLAC distribution. 


of temporal phone calls data and verify how the behavior of the users is evolving through time. 
Also, all the proposed models in this work can be directly applied on the design of generators 
that produce synthetic data, allowing researchers that do not have access to real data to generate 
their own. 


4.3.5 Discussion 

Generality of TLAC 

As we mentioned earlier, one of the major strengths of the TLAC model is its generalization 
power. We showed that even for distributions that oscillate between log-normal and log-logistic, 
or that have irregular spikes that unable them to be correctly fitted by TLAC , TLAC can represent 
them significantly well. Besides this, the simplicity of the TLAC model allow us to directly 
understand its form when its parameters are changed and verify its boundaries. For instance, in 
the case of the CDD, e 13 gives the odds ratio when duration is 1 second. Thus, when e 3 > 1, 
most of the calls have a lower duration than 1 second, which makes the CDD converges to a 
power law, i.e., the initial spike is truncated. Moreover, as a —» 0, the odds ratio tends to be the 
constant e^, what causes the variance to be infinity. By observing Figure 4.25 and concerning 
human calling behavior, we conjecture that /3 is upper bounded by 1 and p is lower bounded by 
0.5. These values are coherent with the global intuition on human calling behavior. 


Additional Correlations 

Given that the vast majority of users’ CDDs can be represented by the TLAC model, it would 
be interesting if we could predict their parameters pi and based on one of their summarized 
attributes. One could imagine that a user that makes a large number of phone calls per month 
might have a distinct CDD than a user that makes only a few. Moreover, we could also think 
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(a) CDF for p 




10' 10 10 5 10 ° 
p 

(b) CCDF for /3 


Figure 4.25: Cumulative distributions for p and (3. We can observe that p is lower bounded 
by 0.5 and f3 is upper bounded by 1. These values are coherent with the global 
intuition on human calling behavior. 


that a user that has many friends and talk to them by the phone regularly may also have a distinct 
CDD from a user that only talks to his family on the phone. In Figures 4.26 and 4.27, we show, 
respectively, the isocontours of the behavior of the and 3 t parameters for users with different 
values of number of phone calls n*, aggregate duration ui, and number of partners p t , i.e., the 
distinct number of persons that the user called in a month. With the exception made for the 
Pi against Wi, we observe that the variance decreases as the value of the summarized attribute 
increases. This suggests that the CDD of high or long talkative users, as well as users with many 
partners, is easier to predict. Moreover, as we can observe in the figures and also in Table 4.5, 
there is no significant correlation between the TLAC parameters and the summarized attributes 
of the users. Thus, we make the following observation: 

Observation 4.3.3 INVARIANT BEHAVIOR. The pi and parameters of user i behave as in¬ 
variant with respect to (a) number of phone calls n,, (b) aggregate duration Wi and (c) number 
of partners p { . 


Attribute 

Correlation with p 

Correlation with (5 

number of phone calls 

0.14 

-0.18 

aggregate duration 

-0.21 

0.01 

number of partners 

0.18 

-0.18 


Table 4.5: Correlations between summarized attributes and p and /3. 


Finally, since there is no significant correlation between the users’ CDD parameters pi and 
with their summarized attributes, we emphasize that these parameters should be considered when 
characterizing user behavior in phone call networks. Moreover, besides characterizing individ¬ 
ual customers, the TLAC model can also be directly applied to the relationship between users, 
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w. 


(a) number of phone calls 


(b) aggregate duration 


(c) number of friends 


Figure 4.26: Isocontours of the users’ CDD efficiency coefficient p and their summarized at¬ 
tributes. 



w. 


Pi 


(a) number of phone calls 


(b) aggregate duration 


(c) number of friends 


Figure 4.27: Isocontours of the users’ CDD weakness coefficient (3 and their summarized at¬ 
tributes. 


analyzing how two persons call each other. One could use, for instance, the p parameter as the 
weight of the edges of the social network generated from phone call records. 
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4.4 Summary of contributions 


In this chapter we presented our work on the analysis of billions of human communication (phone 
call, SMS, and instant message) records of millions of users over months’ of activity. Our re¬ 
search questions focus on understanding human communication patterns. This body of work can 
be grouped into three main tracks: (1) study of social circles; (2) study of reciprocal relations; 
and (3) study of call durations. 

In particular, our main contributions are summarized as the following: 

• Social circles and related patterns: We studied the maximal cliques in our communication 
networks and discovered new power-law-like patterns. More specifically, we found that 
the number of maximal cliques a node participates in follows a power-law relation with its 
degree; this translates to “popularity” growing superlinearly with the number of contacts. 
Related to degree distribution, thus, there exist many nodes which belong to only a few 
cliques while there are only a few nodes that belong to many cliques. We also found that 
the weights on the edges of triangles follow power-laws. Moreover, the discovered patterns 
are stable and persistent over time. 

• Reciprocity and 3PL model: We found that joint distribution Pr (wij,Wji) of the weights 
on mutual edges follow a bivariate pattern for all three types of weights; number of phone 
calls, duration of phone calls and number of SMSs. We proposed the Triple Power Law 
(3PL) function to model this distribution. Our goodness of fit tests showed that 3PL pro¬ 
vides better fits than two other well-known bivariate distributions for skewed data, the 
Bivariate Pareto and the Bivariate Yule. 

In addition, we took a weighted approach to quantify the degree of reciprocity. We ob¬ 
served that reciprocity is higher (1) for mutual pairs with larger local network overlap, 
that is, for people with more common friends; and (2) for mutual pairs with larger degree- 
similarity, that is, for people with similar number of contacts. 

• Call durations and TLAC model: We proposed the TLAC distribution, which fits very 
well the vast majority of individual phone call durations, much better than log-normal and 
exponential. We introduced MetaDist which shows that the collection of TLAC parameters 
follow a striking bivariate Gaussian; MetaDist also remains the same over time, with very 
small fluctuations. 
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Part II 

Generative Models of Networks 


73 



Chapter 5 
Preliminaries 

5.1 Introduction 


How can we build a model that would generate graphs that look like real data? The goal here is 
to develop a graph generation scheme such that the output graph obeys as many of the patterns 
observed in real-world graphs as possible. 

As we discussed in previous chapters, many fascinating properties of real-world graphs have been 
discovered, such as small and shrinking diameter [Albert et al., 1999; Leskovec et al., 2005b], 
as well as numerous power-laws [Akoglu et al., 2008; Chakrabarti et al., 2004b; Faloutsos et al., 
1999; Kleinberg et al., 1999; Leskovec et al., 2005b; Mcglohon et al., 2008; Newman, 2005; 
Siganos et al., 2003; Tsourakakis, 2008]. As a result of such interesting patterns being discov¬ 
ered, and for many other reasons which we will discuss next, how to find a model that would 
produce synthetic but realistic graphs is a natural question to ask. There are several applications 
and advantages of modeling real-world graphs: 

• Simulation studies', if we want to run tests for, say a spam detection algorithm, and want to 
observe how the algorithm behaves on graphs with different sizes and structural properties, 
we can use graph generators to produce such graphs by changing the parameters. This is 
also true when it is difficult to collect any kind of real data. 

• Sampling/Extrapolation: we can generate a smaller graph for example for visualization 
purposes or in case the original graph is too big to run tests on it; or conversely to generate 
a larger graph for instance to make future prediction and answer what-if questions. 

• Summarization/Compression: model parameters can be used to summarize and compress 
a given graph as well as to measure similarity to other graphs. 

• Motivation to understand pattern generating processes', graph generators give intuition 
and shed light upon what kind of processes can (or cannot) yield the emergence of certain 
patterns. Moreover, modeling addresses the question of what patterns real networks exhibit 
that needs to be matched and provides motivation to figure out such properties. 

Ideally, a desired graph generator should be: 
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1. realistic : it would produce graphs that obey all the discovered “laws” of real-world graphs 
with appropriate values. 

2. simple: it would be easy to understand and it would intuitively lead to the emergence of 
macroscopic patterns. 

3. parsimonious: it would require only a few number of parameters. 

4. flexible: it would be able to generate the cross-product of weighted/unweighted, directed 
/undirected and unipartite/bipartite graphs. 

5. fast: the generation process would ideally take linear time with respect to the number of 
edges in the output graph. 

Earlier work on real-world networks focused on static snapshots and their structural properties. 
As a result, earlier graph generators that try to mimic real graphs are limited to capturing only 
these structural properties. Even then, those generators focus on modeling a single property 
of networks, and fail to mimic others. For example, the ‘small-world’ model [Watts and Stro- 
gatz, 1998] tries to capture the ‘small diameter’ phenomenon, or the ‘preferential attachment’ 
model [Barabasi and Albert, 1999] seeks to generate skewed degree distributions. 

In the first chapter of this part, we focus on building generators, capturing dynamic and weighted , 
as well as static structural properties. We first introduce the Recursive Tensor Model (RTM) that 
generates weighted, time-evolving graphs. RTM is a generalization of Kronecker graph genera¬ 
tors [Leskovec et al., 2005a] to the time-evolving setting. While it is mathematically tractable, it 
comes with some disadvantages of Kronecker models; namely, multinomial/lognormal (instead 
of power-law) distributions and fixed number of nodes (see related work 5.2 for details). Next, 
we introduce the Random Typing Graphs (RTG) that uses a process of ‘random typing’, to gen¬ 
erate source and destination node identifiers. It overcomes the two shortcomings of RTM and it 
meets all the above desired properties. In fact, we show that it can generate graphs that obey all 
eleven patterns that real graphs were observed to typically exhibit to date. 

In the second chapter of this part, we focus on building a generator, capturing human communi¬ 
cation patterns. We introduce the Pay and Call (PaC) model, which is a utility-driven generator 
that models the way in which humans decide when and whom to contact. Our guiding principle 
is that humans balance a trade-off between the cost of the communication (in time and money), 
and its benefit (in valuable information and emotional support). 

The following two chapters in this part are based on work as cited below. 

• Chapter 6 Models of graph topology [Akoglu and Faloutsos, 2009; Akoglu et al., 2008] 

• Chapter 7 Model of human communications [Du et al., 2009]. 


5.2 Related Work 


Modeling real-world graphs successfully is an elusive task. A vast majority of earlier graph 
generators have focused on modeling a small number of common properties, but fail to mimic 
others. The very first generative model was proposed by Erdos & Renyi [Erdos and Renyi, I960]. 
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The model begins with a fixed number of nodes, and adds edges, where any pair of nodes has the 
same and independent probability of being linked by an edge. While it has some interesting prov¬ 
able properties, it fails to produce a number of realistic properties, most notably the heavy-tailed 
degree distribution. Another striking generator is the ‘preferential attachment’ model [Barabasi 
and Albert, 1999], where at each time step nodes are added and ‘prefer’ to link to high-degree 
nodes. This in turn leads to small diameter and heavy-tailed degree distributions; however, this 
and related models lack the shrinking diameter property. There exists a group of other gen¬ 
erators such as the ‘small-world’ [Watts and Strogatz, 1998], ‘winners don’t take all’ [Pennock 
et al., 2002], forest fire ’ [Leskovec et al., 2005b], and ‘butterfly’ [Mcglohon et al., 2008] models. 
In addition, recursive models using Kronecker multiplication have proved useful for generating 
self-similar properties of graphs [Leskovec et al., 2005a]. [Chakrabarti and Faloutsos, 2006] 
provides a detailed survey on graph generators. In general, these methods are limited in trying 
to model some static graph property while neglecting others as well as dynamic properties or 
cannot be generalized to produce weighted graphs. 

Kronecker graph generators [Leskovec et al., 2005a] are successful in the sense that they match 
several of the properties of real graphs and they have proved useful for generating self-similar 
properties of graphs. However, they have two disadvantages: The first is that they generate 
multinomial/lognormal distributions for their degree and eigenvalue distribution, instead of a 
power-law one. The second disadvantage is that it is not easy to grow the graph incrementally: 
They have a fixed, predetermined number of nodes (say, N k , where N is the number of nodes of 
the generator graph, and k is the number of iterations); where adding more edges than expected 
does not create additional nodes. 

Random dot product graphs [Kraetzl M., 2005; Young and Scheinerman, 2007] assign each ver¬ 
tex a random vector in some d-dimensional space and an edge is put between two vertexes with 
probability equal to the dot product of the endpoints. This model does not generate weighted 
graphs and by definition only produces undirected graphs. It also seems to require the computa¬ 
tion of the dot product for each pair of nodes which takes quadratic time. 

A different family of models, often referred to as games of network formation and that are mainly 
from the fields of economics and game theory, are utility-based. In such models, there exist 
agents that try to optimize a predefined utility function and the network structure takes shape from 
their collective strategic behavior. [Laoutaris et al., 2008] proposes a network formation game, 
where links have costs and lengths, and players have preference weights on the other players, 
to study the properties of pure Nash equilibria [Nash, 1951] in different settings. [Albers et al., 
2006], [Demaine et al., 2009], and [Fabrikant et al., 2003] study a similar game where players 
do not have fixed budgets and the cost function is defined in terms of the sum of the number of 
edges. [Even-Dar et al., 2007] proposes a network creation game where nodes act as buyers and 
sellers such that the resulting graphs are bipartite. This class of models, however, is usually hard 
to analyze. 

In this body of work, in contrast to previous work, we explicitly focus on generators for time- 
evolving as well as weighted graphs. Moreover, our models are often mathematically tractable. 
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Chapter 6 

Models of network topology 


Problem Statement: How could we have a realistic generative model that will produce 
synthetic graphs that look like reed, i.e. graphs that obey cdl the patterns we know so far, as well 
as the newly discovered ones for dynamic and weighted graphs? 


In this chapter, we present two generative models: (1) Recursive Tensor Model, which uses 
Kronecker tensor product to produce time-evolving weighted graphs, and (2) Random Typing 
Graphs, based on a simple random typing procedure that mimics human behavior well. 


6.1 RTM: Recursive Tensor Model 


The high level idea behind our Recursive Tensor Model is to use recursion, in conjunction with 
tensors (n-dimensional extension of matrices). Recursion and self-similarity naturally leads to 
modular network behavior (“communities-within-communities”), power laws [Leskovec et al., 
2005a], as well as bursty traffic [Wang et al., 2002]. Earlier work used self-similarity to generate 
static snapshots of unweighted graphs [Chakrabarti et al., 2004b]. Here, we show how to build a 
generator that will also match dynamic and weighted properties. 

The main idea is to use recursion not only on the adjacency matrix, but also on the time di¬ 
mension. Specifically, we start with a small tensor X that has 3 sides (‘modes’): (a) senders 
(b) recipients and (c) time. We call the graph represented by a tensor a ‘t-graph’ that evolves 
over time (see Fig. 6.1 (a-b)). Then, we recursively substitute every cell ( i,j,t ) of the original 
tensor X, with a copy of itself, and multiply it with the value (see Fig. 6.1 (c) for illustration 
and Definition 2 for full details). Thanks to the self-similarity of the construct, we expect the 
resulting tensor to have all the properties we want. 

First, we give the details of the construction. Secondly, we provide proofs to show that our model 
will generate graphs with desired properties. Finally, we give our experimental results. 
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(a) (4x4x3) tensor T —> t-slices 
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(b) corresponding t-graph over time 



(c) Kronecker product of T by itself 
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Figure 6.1: (a) An example for the initial tensor T of size (4 x 4 x 3). The ‘t-slices’ represent 
the changes on the adjacency matrix at every other time step, (b) The corresponding 
graph represented by the tensor in part (a). It changes according to the ‘t-slices’ over 
time, (c) Kronecker product of T by itself produces a self-similar tensor. 


6.1.1 Model Description 

Throughout this section, we use the following conventions: uppercase bold letters for matrices, 
lowercase bold letters for vectors, lowercase letters for scalars, and calligraphic uppercase letters 
for tensors. A list of symbols used is listed in Table 6.1. 


Symbol 

Description 

A,B,C 

Tensors used to illustrate recursive tensor product 


Entry of a tensor 

X 

Initial tensor in RTM model 

Qa 

t-graph (time-evolving graph) represented by tensor A 

D t 

t th slice of final tensor V in RTM 

St 

Total weight of 

e t 

Number of edges of D t 

w v 

Total weight of a tensor V, or )T) t s t 

Sz>,r 

Temporal profile of D at resolution r 

PV,r 

Normalized temporal profile of V at resolution r 


Table 6.1: Table of symbols used in RTM notation. 


For the construction, we choose an initial (N x N x r) tensor X with nonzero cells (■ i,j,t ) 
indicating an edge from node i to node j at time tick t. We initialize the cells so that the initial 
t-graph(t- for time-evolving) Qz represented by X looks like a miniature real-world graph. We 
provide details for how to initialize X in Section 5.3. 

Note that RTM works for both directed and bipartite t-graphs. A bipartite t-graph can be repre¬ 
sented by an (N x M x r) tensor. For simplicity, we focus on unipartite graphs in our work. 
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We propose to use Recursive Tensor Multiplication to produce a time-evolving graph. Our 
method extends Kronecker product 1 of two matrices by adding a third ‘mode’. Kronecker prod¬ 
uct of two matrices is defined as follows: Given two matrices A and B of sizes (N x M) and 
( N' x M'), respectively, the Kronecker product of A and B, namely matrix C of dimension 
(N * N') x (M * M') is given by 


C = A®B 


/ a^iB ai i2 B 

a 2 .iB a 2j2 B 


®i,mB \ 
02 ,mB I 


\ ajvpB ajv j2 B 


0jv,mB J 


Definition 2 (Recursive Tensor Multiplication) Given two tensors A of size ( NxMxr) and B 
of size (N' x M' xr'), Recursive Tensor Multiplication C of A and B is obtained by replacing each 
cell ctijj of tensor A with a it j }t * B. The resulting tensor C is of size (N * N') x (M * M') x (r * r') 
such that 


An example of the Recursive Tensor Multiplication of a (3 x 3 x 3) tensor by itself is given in 
Fig. 6.1 (c). 

To generate a growing graph over time, we get the ‘Recursive Tensor Multiplication ’ of the initial 
(N x N x r) tensor X by itself k times as: 

X fc = £> =X©X© ... ©X 

S ’ v y 

k times 

and then we take the final tensor D to represent our data. The data spans r k number of time 
ticks with N k nodes. At every time step t (t — {1, 2,..., r fc }), we get the t-slice (see Definition 3 
below) D t of V , and for each nonzero cell of D t , we add an edge between node i and node j 
with weight a i} j. If the edge already exists, we increase the weight vj hJ by the same amount. 


Initializing X 

In order to take advantage of the self-similarity property of our construct, we want the initial 
graph Qx to be a realistic graph itself. Basically, one can use any graph generator in the literature 
that is known to produce realistic graphs [Erdos and Renyi, 1960; Leskovec et al., 2005b; Watts 
and Strogatz, 1998] to generate Q x . 

To our knowledge, RTM is the first weighted graph generator; thus we also take weights into 
consideration as follows. We force the initial graph to obey the WPL at all time ticks. That is, 
when a link occurs between two nodes, we put weight on the edge, so that number of edges E(t) 
and total weight W(t) over time follow a power law, with a user-specified exponent a. 

'Unfortunately, Kronecker product C of two matrices A and B is also called Kronecker Tensor multiplication, 
despite A, B, C are matrices. To disambiguate, we use the name RTM where A, B, C are in fact tensors. 
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Having given the details of the construction, next we give proofs for the characteristics that RTM 
will generate. Before that, we define the terms we use throughout the proofs. 

Definition 3 (t-slice of a tensor T) Given a tensor T of size (N x M x r), t-slice of T is a 
matrix T t such that 


T t = T(i, j, t ), Vi, Vj, 1 < i < N, 1 < j < M 

Definition 4 ((Normalized) temporal (t-) profile of T) Given a tensor T of size (N x M x r), 

let s t , denote the total weight of its t-slice. Then, the t-profile ofT is a (1 x r) vector, such that 

S T ,0 = (Si, S 2 , ■ ■ . , Sr) 

Total weight Wj- ofT can be written as X)i=i s t- Then, normalized t-profile ofT is a (1 x r) 
vector, such that 

_ /Si S2 S T 

Pr ’ 0 = ( wv , wv’"' , wv ) 


6.1.2 Theorems and Proofs 

Recursive Tensor graphs can be shown to exhibit several real-world graph properties. In partic¬ 
ular, if we choose the initial graph to be a miniature of a real graph, after recursive iterations 
of RTM, the resulting graph will follow similar properties as of the initial graph due to self¬ 
similarity of the construction. 

Theorem 1 (Self-similar and Bursty Edge/Weight Additions) Let edge/weight additions for 
T with pi,o be self-similar and bursty for which the slope of the entropy plot is 

T 

slope = H( p x ,o) = ~ y] Pi,o(i)lo92(px,o(i)), 

1=1 

After k iterations of RTM, edge/weight arrivals over time for V are also self-similar and bursty. 
The slope of the entropy plot over all aggregation levels r ofV is equal to 

slope = #(pxv) = H( px, 0 ), Vr 

where II(p- D r ) is the slope of the entropy plot at aggregation level r. Furthermore, the slope 
does not change with the value ofk, that is, burstiness is independent ofsccde. 

Proof We will prove for weight additions and similar arguments apply for edge additions. 

After k iterations of RTM , total weight of V becomes 

Wv = = (si + S2 + • • • + s T ) k 

At aggregation level (resolution) 1, we group slices by r fc_1 into r groups. Then, \V- D can be 
written as 

W v = Sl * Wff- 1 + s 2 * W^ 1 + ... + s T * W^ 1 
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Then for 1 < t < r. 


rk—1 


Pv,l{t) = 


St * Wf " s t 


= 77T = PZ,o(t) 


w£ W x 

At aggregation level 2, we group slices by r fc_2 into r 2 groups. Then, 


w v = 

Sl * (si * W k ~ 2 + s 2 * W k ~ 2 + .. 

,. + s T * W k ~ 2 

+ 

s 2 * (Si * W k ~ 2 + S 2 * W k ~ 2 + .. 

,. + s T * W k ~ 2 

+ 



+ 

s T * (si * W k ~ 2 + s 2 * W k ~ 2 + . 

.. + s r * W k ~ 2 


For any slice i at level 1 and t at level 2, 1 < t < r 2 , 

Pd, 2 (f) = 


Si * s t * 2 _ s f _ 

TXr L_1 TIT- PX,0^J 

Si * Wj 1 VFr 


Finally, at aggregation level k, we group slices by r° into r k groups as 


Wd — (si) fc 1 * (si + S 2 + • • • + s r ) 

+ (si) fc 2 * S 2 * (si + S 2 + • • • + s T ) 

+ 

+ (s r )k 1 * (fil + S2 + • • • + Sr) 


For all combinations of (k — 1) slices at levels from 1 to (k — 1), let c 3 denote the corresponding 
coefficients, 1 < j < r fc_1 , and for any slice t at level k, 1 < t < T k 


Pv,k{t) 


Cj * s t 

Cj * W x 


St 


Px,0 it) 


We showed that normalized t-profile, p-p, r , of V remains the same at all aggregation levels r and 
is equal to that of the initial tensor X. So, we conclude that bias of burstiness for V is the same 
as that of X for all values of k, since H(px>,r ) would not change for Vr. | 

As an example, starting with a (10 x 10 x 2) X, where Pz,o(l) : Px,o(2) = 0.175 : 0.825; after 
k = 3 iterations, the slope of the entropy plot is obtained to be 0.669, which is equal to 


H ( P x,o) = -175 * % 2 (-175) + .825 * log 2 (. 825) 


See Fig. 6.3 (c). 

Theorem 2 (Weight Power Law (WPL)) If the initial graph Q x exhibits the WPL at all time 
ticks, that is, number of edges E(t) and total weight W(t) over time follow a power law with 
exponent a, Qd shows the same property at time ticks 1, r 1 , r 2 ,..., r k with exactly the same 
exponent a. 
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Proof We are given the condition that Q x follows the WPL at all time ticks, that is, 

T T 

e i — s i> ( e i + e 2 )“ = (si + s 2 ),..., e t ) a = ( 7: s t ). 

t =1 1 

After k iterations of RTM, the resulting graph has E k = (X][=i e t) k edges and W x = (X][=i s t) k 
total weight. At t = r k , 

E a = W x =>• (£ a ) fc = W% =» (£ fc ) Q = Wx 
At aggregation level 1, E k can be written as, 

= ei * X^ -1 + e 2 * X^ 1 + ... + e r * E k ~ x 

Same argument holds for W x . So, at t — r k ~ l , number of edges is (ei * X fc_1 ) and total weight 

is (si * W| _1 ). And, 

(d * X^ 1 )" = (ei)“ * (F) 1 '" 1 = si * W^- 1 

In general, at every aggregation level r, at / = r r , the graph follows the WPL as 

(e[ * E k ~ r ) a = (e") r * W k ~ r 

I 

We observe that when we interpolate total weight W(t) versus number of edges E(t) at all time 
ticks t G {1,2,,.., T k ) for the final graph Q-p, the resulting exponent remains very close to a. In 
Fig. 6.3 (b), the user-specified WPL exponent is 1.5, which is equal to the slope when points at 
t = {1, 2,4, 8} are used to fit a line (k = 3). When all points are used, the slope is 1.47. 

6.1.3 Model Validation and Analysis 

Having created the initial tensor, we take the Recursive Tensor Multiplication of X by itself k 
times. The final tensor V is a ( N k x N k x r k ) tensor spanning r k time ticks. At every time tick 
t, we take the t-slice D t of V. Next, we simply introduce an edge from node i to node j with 
weight a i: j for every nonzero entry of D,. If node i or node j did not exist, we introduce new 
node(s). If both existed, we only increase the edge weight by a hr 

We did several experiments for different values of N, r and k. Our model produced realistic 
graphs for a wide range of parameters, the results being independent of the number of iterations 
k. In fact, k can be chosen as large as the size of the graph that one needs to generate. 

As a comparison with real-world data, we give the plots showing reported laws for BlogNet 
in Fig. 6.2. The plots our model generated for N = 10, r = 2 and k — 3 are shown in 
Fig. 6.3. In particular, we show (a) the Densification Power Law (DPL); (b) the Weight Power 
Law (WPL); (c) bursty weight additions; (d) the Ai Power Law (LPL), (e) the Ai jW Power Law 
and finally, (f) the Edge Weight Power Law (EWPL). See Table 2.2 and Chapter 3 for details on 
these laws. Other desired characteristics such as small and shrinking diameter, the gelling point 
and the Degree Power Law for degree distribution of nodes are also matched, but omitted here 
for brevity. 
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(a) Densification Power Law (b) Weight Power Law 



(c) AW entropy 





(d) Ai Power Law (e) Ai )W Power Law (f) Edge Weight Power Law 

Figure 6.2: Plots showing related laws that real-world graphs obey for BIogNet. First row shows 
previous laws while second row shows observations of this work. 




(a) Densification Power Law (b) Weight Power Law 




|E| 


(d) Ai Power Law 



0.33797X + (2.8092) = y 



(e) Aj w Power Law (f) Edge Weight Power Law 


Figure 6.3: Plots showing related laws our RTM generator produced. Notice that they are very 
similar in all the listed properties to those of BIogNet. 
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6.2 RTG: Random Typing Graphs 


Zipf introduced probably the earliest power law [Zipf, 1932], stating that, in many natural lan¬ 
guages, the rank r and the frequency f r of vocabulary words follow a power-law f r oc 1/r. 
[Mandelbrot, 1953] argued that Zipf‘s law is the result of optimizing the average amount of in¬ 
formation per unit transmission cost. [Miller, 1957] showed that a random process also leads to 
Zipf-like power laws. He suggested the following experiment: “A monkey types randomly on a 
keyboard with k characters and a space bar. A space is hit with probability q\ all other charac¬ 
ters are hit with equal probability, 1 L q) . A space is used to separate words”. The distribution 
of the resulting words of this random typing process follow a power-law. Conrad and Mitzen- 
macher [Conrad and Mitzenmacher, 2004] showed that this relation still holds when the keys are 
hit with unequal probability. 

Our RTG model generalizes the above model of natural human behavior, using “random typing”. 
We build the model in three steps, incrementally. In the next two steps, we introduce the base 
version of the proposed model to give an insight. However, as will become clear, it has two 
shortcomings. In particular, the base model does not capture (1) homophily, the tendency to 
associate and bond with similar others- people tend to be acquainted with others similar in age, 
class, geographical area, etc. and (2) community structure, the existence of groups of nodes that 
are more densely connected internally than with the rest of the graph. 


6.2.1 Initial RTG with Independent Equiprobable keys 

As in Miller’s experimental setting, we propose each unique word typed by the monkey to rep¬ 
resent a node in the output graph (one can think of each unique word as the label of the corre¬ 
sponding node). To form links between nodes, we mark the sequence of words as ‘source’ and 
‘destination’, altematingly. That is, we divide the sequence of words into groups of two and link 
the first node to the second node in each pair. If two nodes are already linked, the weight of the 
edge is simply increased by 1. Therefore, if W words are typed, the total weight of the output 
graph is W/ 2. See Figure 6.4 for an example illustration. Intuitively, random typing introduces 
new nodes to the graph as more words are typed, because the possibility of generating longer 
words increases with increasing number of words typed. 

Due to its simple structure, this model is very easy to implement and is indeed mathematically 
tractable. If W words are typed on a keyboard with k keys and a space bar, the probability p of 
hitting a key being the same for all keys and the probability of hitting the space bar being denoted 

as q=( 1 — kp): 

Lemma 2 The expected number of nodes N in the output graph G of the RTG-IE model is 

N cx W~ lo9pk . 

Proof Given the number of words W, we want to find the expected number of nodes N that 
the RTG-IE graph consists of. This question can be reformulated as follows: ’’Given W words 
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Figure 6.4: Illustration of the RTG-IE. Upper left: how words are (recursively) generated on a 
keyboard with two equiprobable keys, ‘a’ and ‘b’, and a space bar; lower left: a 
keyboard is used to randomly type words, separated by the space character; upper 
right: how words are organized in pairs to create source and destination nodes in 
the graph over time; lower right: the output graph; each node label corresponds to a 
unique word, while labels on edges denote weights. 


typed by a monkey on a keyboard with k keys and a space bar, what is the size of the vocabulary 
VT The number of unique words V is basically equal to the number of nodes N in the output 
graph. 

Let w denote a single word generated by the defined random process. Then, w can recursively be 
written as follows: “w : c,w\S”, where c, is the character that corresponds to key i, 1 < i < k, 
and S is the space character. So, V as a function of model parameters can be formulated as: 

V(W) = V{ Cl ,Wp) + V(c 2 ,Wp) +... + V(c k ,Wp) + V(S) 

= k*V(Wp) + V(S) = k*V(Wp) + { J’ 

where q denotes the probability of hitting the space bar, i.e. q — 1 — kp. Given the fact that W 
is often large, and (1 — q) < 1, it is almost always the case that w=S is generated; but since this 
adds only a constant factor, we can ignore it in the rest of the computation. That is, 

V(W) « k* V(Wp) = k* (k* V{Wp 2 )) = k n * U(l) 

where n = log r ,(\/W) = —log p W. By definition, when W=l, that is, in case only one word is 
generated, the vocabulary size is 1, i.e. V(l)=l. Therefore, 

V{W) = N oc k n = k~ lo9pW = W~ lo9pk . 






























Figure 6.5: (a) Rank vs count of vocabulary words typed randomly on a keyboard with k 
equiprobable keys (with probability p) and a space bar (with probability q), follow a 
power law with exponent a = logkP■ Approximately, the area under the curve gives 
the total number of words typed, (b) The relationship between number of edges E 
and total weight W behaves like a power-law (k= 2, p=0A). 


The above proof shown using recursion is in agreement with the early result of [Miller, 1957], 
who showed that in the monkey-typing experiment with k equiprobable keys (with probability 
p) and a space bar (with probability q), the rank-frequency distribution of words follow a power 
law. In particular, 

/(r) OC r -1+k '3fc( 1 “'?)-l — r lo 9kP 

In this case, the number of ranks corresponds to the number of unique words, that is, the vocab¬ 
ulary size V. And, the sum of the counts of occurrences of all words in the vocabulary should 
give W, the number of words typed. The total count can be approximated by the area under the 
curve on the rank-count plot. See Figure 6.5 (a). Next, we give a second proof of Lemma 2 using 
Miller’s result. 

Proof 2 Let a = logup and C (r) denote the number of times that the word with rank r is typed. 
Then, C{r) — cr a , where C{r) min = C(V) = cV a and the constant c = C(V)V~ a . Then we 
can write W as 

dr ) = (^TT 

= - (—^ ) - 
where d = ( * , where a < — 1 and C(V) is very small (usually 1). Therefore, 

V = N oc W~° = W~ lo9pk . 


W 


= c(v)v~ a ~ c(y)v~ a ^ 
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Lemma 3 The expected number of edges E in the output graph G of the RTG-IE model is 

Q l 0 9pk 

E « w~ l ° 9pk * (1 + c'logW ), for c' = -> 0 . 

-logp 

Proof Given the number of words W, we want to find the expected number of edges E that the 
RTG-IE graph consists of. The number of edges E is the same as the unique number of pairs of 
words. We can think of a pair of words as a single word e, the generation of which is stopped 
after the second hit to the space bar. So, e always contains a single space character. Recursively, 
“e : c i e\Sw , \ where “w : c i w\S , \ So, E can be formulated as: 


E(W) = k * E(Wp) + V(Wq) (6.1) 

V(Wg ) = k*V(Wqp) + { £ Ifdgyv^" < 6 - 2 ) 


From Lemma 2, Equation (6.2) can be approximately written as V(Wq) = {W q)~ l ° 9pk . Then, 
Equ.(l) becomes E(W) = k * E(Wp) + cW°, where c = q~ l °9p k and a = —log p k. Given that 
E(W=1)=1, we can solve the recursion as follows: 


E(W) « k * (k * E(Wp 2 ) + c(Wp) a ) + cW a 

= k * (k * (k * V(Wp 3 ) + c(Wp 2 ) a ) + c(Wp) a ) + cW a 
= k n * 1/(1) + k n ~ l * c(Wp n ~ l ) + k n ~ 2 * c{Wp n ~ 2 ) a + ... + cW a 
= k n * 1/(1) + clE Q ((A;p Q ) n - 1 + (kp a ) n ~ 2 + ... + 1) 


where n = log p (l/W ) = —log p W. Since kp a = kp l ° 9pk = 1, 

E(W) w k n * 1/(1) +n* cW a = k~ lo9pW + c ~ l ° 9w W~ l ° 9pk = W~ l ° 9pk (l + c’logW) 

-logp 


where d 


C 

-logp 


q-logpk 

-logp 


> 0 . 


I 


The above function of E in terms of 11' and other model parameters looks like a power-law for a 
wide range of W. See Figure 6.5 (b). 


Lemma 4 The in(out)-degree d n of a node in the output graph G of the RTG-IE model is power 
law related to its total in(out)-weight W n , that is, 


W n (x d~ lo9kP 


with expected exponent —logup > 1. 


87 



Proof We will show that W n oc d~ l ° 9kP for out-edges, and a similar argument holds for in-edges. 
Given that the experiment is repeated W times, let W n denote the number of times a unique word 
is typed as a source. Each such unique word corresponds to a node in the final graph and W n is 
basically its out-weight, since the node appears as a source node. Then, the out-degree d n of a 
node is simply the number of unique words typed as a destination. From Lemma 2, 

W n cx d~ lo9kP , for - log k p > 1. 


I 

Even though most of the properties listed at the beginning of this section are matched, there 
are two problems with this model: (1) the degree distribution follows a power-law only for 
small degrees and then shows multinomial characteristics (see Figure 6.6 (top)), and (2) it does 
not generate homophily and community structure, because it is possible for every node to get 
connected to every other node, rather than to ‘similar’ nodes in the graph. 


6.2.2 Initial RTG with Independent Un-equiprobable keys 

We can spread the degrees so that nodes with the same-length but otherwise distinct labels would 
have different degrees by making keys have unequal probabilities. This procedure introduces 
smoothing in the distribution of degrees, which remedies the first problem introduced by the 
RTG-IE model. In addition, thanks to [Conrad and Mitzenmacher, 2004], we are still guaranteed 
to obtain the desired power-law characteristics as before (see Figure 6.6 (bottom)). 


6.2.3 Proposed RTG: Model Description 

What the previous model fails to capture is the homophily and community structure. In a real 
network, we would expect nodes to get connected to similar nodes (homophily), and form groups 
and possibly groups within groups (modular structure). In our model, for example on a keyboard 
with two keys ‘a’ and ‘b’, we would like nodes with many ‘a’s in their labels to be connected 
to similar nodes, as opposed to nodes labeled with many ‘b’s. However, in both RTG-IE and 
RTG-IU it is possible for every node to connect to every other node. In fact, this yields a tightly 
connected core of nodes with rather short labels. 

Our proposal to fix this is to envision a two-dimensional keyboard that generates source and 
destination labels in one shot, as shown in Figure 6.7. The previous model generates a word for 
source, and, completely independently, another word for destination. In the example with two 
keys, we can envision this process as picking one of the nine keys in Figure 6.7 (a), using the 
independence assumption: the probability for each key is the product of the probability of the 
corresponding row times the probability of the corresponding column: pi for letter /, and q for 
space (‘S’). After a key is selected, its row character is appended to the source label, and the 
column character to the destination label. This process repeats recursively as in Figure 6.7 (b), 
until the space character is hit on the first dimension in which case the source label is terminated 
and also on the second dimension in which case the destination label is terminated. 
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Figure 6.6: Top row: Results of RTG-IE (k = 5, p = 0.16, W = 1 M). The problem with 
this model is that in(out)-degrees form multinomial clusters (left). This is because 
nodes with labels of the same length are expected to have the same degree. This 
can be observed on the rank-frequency plot (right) where we see many words with 
the same frequency. Notice the ‘staircase effect’. Bottom row: Results of RTG- 
IU (k = 5, p = [0.03,0.05,0.1,0.22,0.30], W = 1M). Unequal probabilities 
introduce smoothing on the frequency of words that are of the same length (right). 
As a result, degree distribution follows a power-law with expected heavy tails (left). 


In order to model homophily and communities, rather than assigning cross-product probabilities 
to keys on the 2-d keyboard, we introduce an imbalance factor f3, which will decrease the chance 
of a-to-b edges, and increase the chance for a-to-a and b-to-b edges, as shown in Figure 6.7 
(c). Thus, for the example that we have, the formulas for the probabilities of the nine keys 
become: 

prob(a , b ) = prob(b, a) = p a PbP , prob(a, a) = p a — ( prob(a, b) + prob(a, S )), 
prob(S, a) = prob(a, S ) = qp a /3 , prob(b } b) = pb — ( prob[b , a) + prob(b, S )), 
prob(S, b) = prob{b , S ) = qpb/3 , prob(S, S) = q — ( prob(S , a) + prob(S, b)). 

By boosting the probabilities of the diagonal keys and down-rating the probabilities of the off- 
diagonal keys, we are guaranteed that nodes with similar labels will have higher chance to get 
connected. The pseudo-code of the generator is given in Algorithm 1. 
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(a) first level 


(b) recursion 



(c) communities 


Figure 6.7: The RTG model: random typing on a 2-d keyboard, generating edges (source- 
destination pairs). See Algorithm 1. (a) an example 2-d keyboard (nine keys), 
hitting a key generates the row(column) character for source(destination), shaded 
keys terminate source and/or destination words, (b) illustrates recursive nature, (c) 
the imbalance factor 3 favors diagonal keys and leads to homophily. 


Algorithm 1: RTG Model 

Input: number of keys k, probability q to hit ‘space’, iterations W, imbalance factor 3 
Output: edge-list L for output graph Q 

1 Initialize (k + l)-by-(fc + 1) matrix P with cross-product probabilities 

2 // in order to ensure homophily and community structure 

3 Multiply off-diagonal probabilities by /?, 0 < /3 < 1 

4 Boost diagonal probabilities s.t. sum of row (column) probabilities remain the same. 

5 Initialize edge list L 

6 foreach / to W do 

7 LI, L2 •<— SelectNodeLabels (P, k) 

8 Append L1, L2 to L 


Next,we describe how we handle time so that edge/weight additions are bursty and self-similar. 
We also discuss the generalizations of the model in order to produce all types of uni/bipartite, 
(un)weighted, and (un)directed graphs. 


Burstiness and Self-similarity 

Most real-world traffic as well as edge/weight additions to real-world graphs have been found 
to be self-similar and bursty [Crovella and Bestavros, 1996; Gomez and Santonja, 1998; Gribble 
et al., 1998]. Therefore, in this section we give a brief overview of how to aggregate time so that 
edge and weight additions, that is AE and AW, are bursty and self-similar. 

Notice that when we link two nodes at each step, we add 1 to the total weight W. So, if every 
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Procedure SelectNodeLabels(P, k) 

Input: Probability matrix P, number of keys k 
Output: Source label LI and destination label L2 

1 Initialize LI and L2 to empty string 

2 while not terminated LI and not terminated L2 do 

3 Draw key (i, j) with probability P(i.j) 

4 if i < k, j < k then 

5 Append character ‘i’ to LI and ‘j’ to L2 if not terminated 

6 else if z < k, j=k + 1 then 

7 Append character ‘i’ to LI if not terminated 

8 Terminate L2 

9 else if i=k + 1 ,3 <k then 

10 Append character ‘j’ to L2 if not terminated 

11 Terminate LI 

12 else 

13 |_ Terminate LI and L2 

14 return LI and L2 


step is represented as a single time-tick, the weight additions are uniform. However, to generate 
bursty traffic, we need to have a bias factor b> 0.5, such that 6-fraction of the additions happen 
in one half and the remaining in the other half. We use the b-model [Wang et al., 2002], which 
generates such self-similar and bursty traffic. Specifically, starting with a uniform interval, we 
recursively subdivide weight additions to each half, quarter, and so on, according to the bias b. 
To create randomness, at each step we randomly swap the order of fractions b and (1 — 6). 

Generalizations 

We can easily generalize RTG to model all type of graphs. To generate undirected graphs, we 
can simply assume edges from source to destination to be undirected as the formation of source 
and destination labels is the same and symmetric. For unweighted graphs, we can simply ignore 
duplicate edges, that is, edges that connect already linked nodes. Finally, for bipartite graphs, 
we can use two different sets of keys such that on the 2-d keyboard, source dimension contains 
keys from the first set, and the destination dimension from the other set. This assures source and 
destination labels to be completely different, as desired. 

6.2.4 Model Validation and Analysis 

The question we wish to answer here is how well RTG is able to model real-world graphs. The 
datasets we used are: 


91 





BlogNet: a social network of blogs based on citations (undirected, unipartite and unweighted 
with N=27, 726; 77=126, 227; over 80 time ticks). 

CampOrg : the U.S. electoral campaign donations network from organizations to candidates (di¬ 
rected, bipartite and weighted with 77=23,191; 77=877,721; and W=4, 383,105, 580 over 29 
time ticks). Weights on edges indicate donated dollar amounts. 

In order to evaluate community structure, we use the modularity measure in [Newman and Gir- 
van, 2004]. Figure 13.5 (left) shows that modularity increases with smaller imbalance factor j3. 
Without any imbalance, /3= 1, modularity is as low as 0.35, which indicates that no significant 
modularity exists. In Figure 13.5 (right), we also show the running time of RTG with respect to 
the number of duplicate edges (that is, number of iterations W). Notice the linear growth with 
increasing W. 




Figure 6.8: (left) Modularity score vs. imbalance factor /?, modularity increases with decreas¬ 
ing /?• For 3=1, the score is very low indicating no significant modularity, (right) 
Computation time vs. number of iterations W, time grows linearly with W. 


In Figures 6.9 and 6.10, we show the related patterns for BlogNet and CampOrg as well as 
synthetic results, respectively. In order to model these networks, we ran experiments for different 
parameter values k, q, W, and (3. Here, we show the closest results that RTG generated, though 
fitting the parameters is a challenging future direction. We observe that RTG is able to match the 
long wish-list of static and dynamic properties we presented earlier for the two real graphs. 
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(a) diameter (b) components (c) degrees (d) TPL 



(e) DPL 


(f) entropy AE (g) AiPL 


(h) EPL 



(a) diameter 



(e) DPL 



(b) components 



(f) entropy AE 



(c) degrees 



(g) A^L 


■ -1.1194x + (2.0892) = y[ 



1 D id" 10' io 1 

triangles 

(d) TPL 



(h) EPL 


Figure 6.9: Top two rows: properties of BlogNet: (a) small and shrinking diameter; (b) largest 3 
connected components; (c) degree distribution; (d) triangles A vs number of nodes 
with A triangles; (e) densification; (f) bursty edge additions; (g) largest 3 eigenvalues 
wrt E\ (h) rank spectrum of the adjacency matrix. Bottom two rows: results of RTG. 
Notice the similar qualitative behavior for all eight laws. See Table 2.2 and Chapter 3 
for details on these laws. 
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(g) AxPL 



(h) EPL 



(a) diameter 


(b) components (c) degree distr. 


(d) SPL 



(e) D(W)PL 



0.44133x + (-0.2315) = y[ 


-0.60036X + (1.5527) = y | 


(f) entropy AW(E) (g) AiPL 


(h) EPL 



Figure 6.10: Top two rows: properties of CampOrg; as opposed to BlogNet, CampOrg is 
weighted. So, different from above we show: (d) node weight vs in(inset: 
out)degree; (e) total weight vs number of edges(inset); (f) bursty weight addi- 
tions(inset); Bottom two rows: results of RTG. Notice the similar qualitative be¬ 
havior for all nine laws. See Table 2.2 and Chapter 3 for details on these laws. 
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6.3 Summary of contributions 


In this chapter, we focused on the problem of how to build a generative model that could pro¬ 
duce realistic-looking synthetic networks. We presented two graph generators, namely RTM and 
RTG 1 , that mimic topological properties of real-world networks. Compared to previous work on 
graph generators, these models are two of the first models that capture the dynamic and weighted 
properties, in addition to static unweighted properties, that real networks exhibit. 

Our first model Recursive Tensor Model (RTM) is a simple, recursive generator based on Kro- 
necker tensor multiplication. We rigorously proved that RTM produces several desired charac¬ 
teristics, such as bursty weight additions. We also experimentally validated that it mimics a long 
list of the laws described in previous chapters. On the other hand, RTM has some shortcomings. 
In particular, it requires a fixed, predetermined number of nodes as initial input. The number 
of nodes in the final output graph grows as powers of this initial number over iterations due 
to the nature of the Kronecker product. In other words, new nodes joining the network do not 
emerge naturally. A second issue with RTM is that it generates multinomial/lognormal (instead 
of power-law) distributions. 

Our second model Random Typing Graphs (RTG) is based on a simple ‘random typing’ proce¬ 
dure which is shown to mimic natural behavior very well. The random typing involves hitting 
keys on a keyboard randomly and generating words, i.e. node labels, separated with space char¬ 
acters. This model overcomes the issues associated with RTM as discussed above. In particular, 
setting the probabilities of keys being hit unequally ensures that it generates power-laws. In ad¬ 
dition, the nodes arrive to the network naturally over time- longer typing generates new words 
seamlessly. More importantly, it meets all the five desirable properties given in the introduction. 
Particularly, the RTG model is 

1. realistic; generating graphs that obey all eleven properties that real graphs obey -no other 
generator has been shown to achieve that, 

2. simple and intuitive; yet it generates the emergent, macroscopic patterns that we see in 
real-world graphs, 

3. parsimonious; requiring only a handful of parameters, 

4. flexible; capable of generating weighted/unweighted, directed/undirected, and unipartite/ 
bipartite graphs, and any combination of the above, and 

5. fast; being linear on the number of iterations (on par with the number of duplicate edges 
in the output graph). 


'Source code of our algorithm can be found at www. cs . emu . edu/~ lakoglu/#code 
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Chapter 7 


Models of human communications 


PROBLEM STATEMENT: How could we design a realistic and intuitive generative model that 
will naturally reproduce real human-to-human communication behavior? 


In this chapter, we introduce a novel generator to model the way which humans decide when 
and whom to contact. We argue that such a model should be utility driven , as opposed to earlier 
models (preferential attachment [Barabasi and Albert, 1999], forest-fire [Leskovec et al., 2005b], 
butterfly [Mcglohon et al., 2008], etc.) which are mainly randomness-guided generators. Our 
guiding principle is that humans balance a trade-off between the cost of the communication (in 
time and money), and its benefit (in valuable information and emotional support). 


7.1 PaC: Pay and Call Model 


Every communication, such as phone-call, SMS, and e-mail, has a cost in terms of money, time, 
and equipment. On the other hand, it has a benefit, otherwise humans would not do it. The 
benefits can be psychological and emotional (talking to friends makes us happy), or monetary 
(stock tip), or desirable in other ways. For ease of presentation, we refer to the benefit as if it is 
measured by emotioned dollars. The point of this thought experiment is to set up a utility-driven 
model for the social contacts of humans, which should be more realistic and more informative 
than the ones using randomness. 

Therefore, we assume that people are rational agents, which means telemarketers are excluded 
from the model, and we design our generator to guide the behavior of each agent according to 
a well-defined utility function. Ideally, the fundamental macro-phenomena of a social network 
should then emerge from the simple local behavior of each agent/human. 
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7.1.1 Model Description 


Following the above discussion, we now present our utility-driven model PaC as a Pay and Call 
game. Assume a setting where a set A of n distinct agents create links to one another through 
phone calls. In every round of the game, each agent’s strategy is to choose among other peers to 
whom he will make calls and build links. Links are undirected. Once agent a* £ A calls aj £ A, 
there will be a link between them. The total number of phone-calls that a t and « ? give to each 
other is treated as the weight on the undirected link between them. The PaC model essentially 
includes the following four ingredients: 

• It adopts the agent-based modeling approach. Each agent has a friendliness value, an 
exponential lifetime, a certain amount of capital, and the expected payoffs from talking to 
strangers. 

• The goal of each agent is to invest his limited capital into phone-calls and maximize the 
potential payoffs from each conversation. 

• The per-minute gain of a conversation will be gradually saturated, and finally both of the 
callers and callees will lose interest, and stop the conversation. 

• Each agent a* can ask his partners for recommendations. Every partner recommends the 
profitable agents from his own partners, so a, benefits from talking to the most profitable 
agent within the recommendations. 


Friendliness and Exponential Lifetime 

Each agent has a friendliness value F t £ (0,1) to show his personality. F, approaching to 1 
means the agent is very open and friendly, and F t close to 0 means he is very shy and introverted, 
a,: has a probability Pi, uniformly chosen from 0 to 1, to stay in the game, and has the probability 
1 — Pi to leave the game, so that we can simulate the mixture of different ages in a real scenario. 
Once an old agent leaves, all his links will be removed, and a new agent replaces his position 
with the friendliness and Pi initialized to new values. 


Utility-Driven Phone-Calls and Saturation 

An agent’s payoffs are the difference between the benefits and costs. The benefits are defined 
based on the following considerations. Two open agents usually can benefit emotionally from 
a happy conversation. When an open agent meets a shy agent, they may benefit less from their 
conversation. Finally, two shy agents might gain little in the end. In addition, after two agents 
have been talking for a while, they may gradually lose interest, and gain less emotionally as time 
goes by. For agent a* and a 3 , they can achieve sjF, x Fj x a m ~ l emotional dollars per minute 
from a conversation, where a £ (0,1) is called the saturation factor to represent the loss of 
interest, and m is the number of minutes for which they have been talking. 

For an ///-minutes long conversation, the total benefits are defined as 
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(7.1) 


benefits = \JFj x Fj x (1 + a + a 2 + ... + a m x ) 

,- 1 - a m 

= x —- 

1 - a 

The costs are the expenses of phone-calls, which include C ini and C pm . C ini is the cost to 
initiate a phone-call, and C pm is the per-minute fee. The total costs for an m-minutcs call will be 
Cini + m x Cpm, so our utility function is defined as 

payoffs = benefits - C ini - rn x C pm (7.2) 

and each agent starts and maintains a conversation until the payoffs as given by Equation 7.2 
reach the maximum value or the agent has used all his money. 


Expected Payoffs on Strangers 

At first, each agent is given an initial capital which is enough to make one call only. Since none 
of the agents have ever talked before, agent a, first uniformly calls a stranger aj, and keeps the 
conversation until either the payoffs by Equation 7.2 begin to decrease or s/he spends all his 
money in the call (aj .capital < 0). When the call is finished, a* and aj will achieve the payoffs 
Pj from the conversation. A link is built between a, and with weight 1, and a* will remember 
the payoffs Pj earned by talking to aj. Because aj was first a stranger to a t before they met, a* 
also updates his expected payoffs from talking to strangers as 


S exp = (7.3) 

where S is the total number of times talking to strangers, and Pj is the payoffs achieved at each 
time. S exp is initialized to 0 in the beginning. In each round of the game, agent a* is only allowed 
to call (i j for one time. If a* still has some money left (note that the payoffs earned in the current 
round can only be used in the next round), he will continue to interact with other strangers. 


Recommendations 

Once agent a. t has some partners, he will first prioritize his partners according to the remembered 
payoffs, and talk to them respectively. If the payoffs of the currently chosen partner is less than 
af s expected payoffs from strangers (S exp > 0), a, will stop talking to partners and choose to 
call strangers again. He first asks his partners for recommendations. Every partner will tell a, 
how much money he actually earned by talking to his own partners last time, a, can then pick 
the most profitable agent out of the partners of his partners. If all the recommended agents are 
already his partners, a* will uniformly choose a stranger from the rest. 
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In summary, the PaC model is formulated as pseudo-code in Algorithm 2. 



Procedure Talk2Strangcrs(a i ) 

Input: current agent a* 

1 total 0 

2 if A r (a t ) f 0 anJ a* finds the most profitable agent he never talked to from the 
recommendations then 

3 | aj <r- the most profitable agent 

4 else 

5 aj GetRandom (.A) 

6 while Oj. capital > C im + C' pm c/n<7 ai.S exp > 0 do 

7 maximize payoffs with constraint oq. capital > 0 by Equation 7.2 

8 add aj to AT(aj), add oq to N(a,j) 

9 aj.capital 4— aj.capital + payoffs 

10 total total + payoffs 

11 update a.t.S exp and aj.S exp by Equation 7.3 

12 eij .capital a t .capital + total 

Procedure Talk2Partncrs(a,;) 

Input: current agent a, 

1 total 0 

2 prioritize A r (a,) according to the descending order of the remembered payoffs 

3 for /c 4— 1 to |AT(aj)| do 

4 Oi’s remembered payoffs from at, 

5 if I f > cii-S exp then 

6 maximize payoffs with constraint ai.capital > 0 by Equation 7.2 

7 increase the weight on the link between a* and a*. by 1 

8 aj.capital aj.capital + payoffs 

9 total total + payoffs 

10 else 

11 Talk2Strangers (a*) 

12 break 

99 

13 if ai.capital < 0 then break 

14 ai.capital ai.capital + total 









7.1.2 Model Validation and Analysis 


Our goal here is to show that our model is able to generate degree, weight and clique distributions 
that mimic a real graph like our communication networks. Notice that we only want to show 
qualitative match of the properties. Exact fitting is outside the scope of this work. We decided 
to test our model with respect to all the usual patterns, and specifically the degree distribution, 
weight distribution, as well as the snapshot power law. We also want to qualitatively check 
against our new clique-related patterns, the CDPL, CPL, and the TWL. We simulated the model 
35 times for 100,000 nodes, with C ini = 0.1, C pm = 0.4 and a = 0.9. For each agent a,, F, and 
Pi are randomly chosen from 0 to 1. 

Figure 7.1 shows the results of these checkpoints. The top rows are for the actual graph Q p - 1, and 
the bottom rows are for a synthetic graph, generated by our PaC model. In all cases, notice that 
PaC gives skewed distributions that are remarkably close to the real ones. For example, except 
for the giant connected component which is an isolated point distant from the rest, the size 
distribution of the connected-components conforms to a power-law. The exponents take values 
within the range observed in real world networks with a least-square fit of R 2 > 0.95. 

From earlier research [Mitzenmacher, 2003; Reed and Jorgensen, 2004], we understand how 
heavy-tailed distributions such as power-law, lognormal and DPFN could arise for the degree 
distribution and the node weight distribution. According to [Mitzenmacher, 2003], lognormal 
distributions can be naturally generated by multiplicative processes: For a biological example, at 
each step j, an organism may grow or shrink by a certain percentage according to a random vari¬ 
able Fj. If Xj denotes the current size of the organism, X :j = FjXj _i where Fj is independent 
of i. Consider In Xj = In X 0 + J2l=i l 11 R-- If Fk, 1 < k < j, are independent lognor¬ 
mal distributions, then Xj is always lognormal. If Fk are not lognormal, but are independent and 
identically distributed with finite mean and variance, by Central Fimit Theorem, J21- \ R con¬ 
verges to a normal distribution, and X 3 will asymptotically approach a lognormal distribution. If 
X 3 is lower bounded by a minimum value, then the distribution will become a power-law. If we 
sample the series from X 0 to Xj by a geometrically distributed random time k, we will have a 
geometric mixture of lognormal distributions. This will turn out to be a DPFN distribution with 
two power-laws at both tails. 

Following the PowerTrack method in [Reed and Jorgensen, 2004], we empirically analyze the 
generative process of our PaC model by taking two snapshots Q T . and Q T at time step F t and 7} 
with j — i > 1 . Among the common agents between Q T . and Q Tj , we calculate the ratio X Tj / X T ., 
where X t represents either the degree or the weight for each node. In Figure 7.2, the distribu¬ 
tions of the ratio for both of the degree and weight appear to be parabolic in logarithmic scales. 
This provides good evidence that a lognormal multiplicative process similar to that described 
in [Mitzenmacher, 2003] is involved in the temporal evolution of our model. 

Another important issue is that we also need to test the independence between partners and their 
ratios, and the same for the calls. Here, the correlation coefficients, which are necessary but not 
sufficient for independence [Reed and Jorgensen, 2004], are very small: —0.02 and —0.04 for 
partners and calls respectively. Finally, in each round of the game, every agent has the proba- 
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Figure 7.1: Qualitative comparison between the real graph (top two rows) and our synthetic 
graph (bottom two rows). PaC gives skewed distributions like the real ones. 
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Figure 7.2: The ratio of partners (left), and calls (right) between two different snapshots of PaC 
follow the lognormal distribution. The parabolic line is fitted in red. 


bility Pi to stay or leave the game, which essentially leads to a geometric lifetime. Therefore, 
although we do not explicitly assume any prior distribution about the ratio (the File Model in 
[Mitzenmacher, 2003] explicitly assumes a lognormal distribution for Fp), the PaC model can 
still mimic the DPLN degree distribution and the node weight distribution which are identical 
with the real communication networks. 

By comparing with the existing graph generators, we see that many models like the preferential- 
attachment guided models usually ignore the weight information, and only generate the giant 
connected component. In contrast, our model is able to reproduce the networks that have not 
only the patterns holding in un-weighted networks, but also the patterns followed by the weighted 
communication networks. 


7.2 Summary of contributions 

Many preferential-attachment [Barabasi and Albert, 1999] guided models assume that a newly- 
added node is more likely to be linked to the most popular node of the current graph. However, in 
real world scenarios, incoming nodes are typically unaware of such global structural knowledge 
of the network. Moreover, most earlier generators dictate that nodes will choose contacts at 
random; in contrast, we argue that people choose contacts to maximize some utility. 

In this work we designed an intuitive graph generator, called PaC, for modeling human com¬ 
munication behavior. In this model, each node (a) uses only local information, and (b) uses no 
randomness, but instead tries to maximize a well-defined utility function. The major advantage 
of PaC over older generators is that it can answer what z/scenarios. For example, if the connec¬ 
tion price of each phone-call goes up, will this decrease the average number of friendship edges, 
what about a change in the price-per-minute, what if there is a flat rate, and so on. Based on 
the utility function of PaC, we can explore what the impact of these settings would be on the 
structure and evolution of the network. 


102 










Part III 

Anomaly Detection 
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Chapter 8 


Preliminaries 

8.1 Introduction 


Given a large real-world graph, possibly with weighted edges and attributed nodes, which nodes 
should we consider as “strange ” ? What are the time points at which the graph structure changes 
significantly? Applications of such settings abound: For example, in network intrusion detection, 
we have computers sending packets to each other, and we want to know which nodes misbehave 
(e.g., spammers, port-scanners). In a who-calls-whom network [Seshadri et al., 2008], strange 
behavior may indicate defecting customers, or telemarketers, or even faulty equipment dropping 
connections too often. In a social network, l ik e Facebook and Linkedln, again we want to spot 
users whose behavior deviates from the usual behavior [Leskovec et al., 2008], such as people 
adding friends indiscriminately, in “popularity contests”. 

The list of applications continues: Anomalous behavior could signify irregularities, like credit 
card fraud, calling card fraud, campaign donation irregularities, accounting inefficiencies or fraud 
[Bay et al., 2006], extremely cross-disciplinary authors in an author-paper graph [Sun et al., 
2005], suspicious cargo shipments [Eberle and Holder, 2006], electronic auction fraud [Chau 
et al., 2006; Pandit et al., 2007], and many others. In addition to revealing suspicious, illegal 
and/or dangerous behavior, anomaly detection is useful for spotting rare events, as well as for 
the thankless, but absolutely vital task of data cleansing [Chen et al., 2005; Dasu and Johnson, 
2003]. Moreover, anomaly detection is intimately related with the pattern and law discovery: 
unless the majority of our nodes closely obey a pattern (say, a power law), only then can we 
confidently consider as outliers the few nodes that deviate [Broder et al., 2000]. 

Most existing anomaly and outlier detection algorithms focus on clouds of multi-dimensional 
points, and similarly most algorithms for change and event detection are designed for traditional 
time series data. Our main contribution in this part of the thesis is to develop such methods for 
graph data. We tackle the above problem in various settings; for graphs possibly with weighted 
edges, attributed nodes, and for graphs changing over time. Furthermore, we make our detection 
techniques accessible to human analysts, by providing means for sensemaking. 
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In Chapter 9, we focus on the problem of spotting strange nodes in plain graphs, with weighted 
edges. We refer to a graph for which we only know its structure, that is the nodes and the links 
among them, as a plain graph. We start with answering questions such as: what features should 
we extract to characterize each node?, what patterns and laws do the nodes obey? Based on those 
patterns we observe, we develop OddBall, a scalable, un-supervised method for anomalous 
node detection. The main idea is to focus on the local neighborhood (the egonet), that is, a 
sphere, or a ball (hence the name OddBall) around each node. We show that egonets obey 
some surprising patterns, which gives us confidence to declare as outliers the ones that deviate. 
We apply OddBall 1 to numerous real graphs ( DBLP , political donations, etc.) and we show 
that it indeed spots nodes that a human would agree are strange and/or extreme. 

In Chapter 10 and 11, we address the problem of finding clusters and outliers in attributed graphs. 
We refer to a graph in which the nodes are associated with a set of attributes (or features) as an 
attributed graph. We consider both binary and categorical attributes in those two chapters, re¬ 
spectively. First we introduce PICS 1 , to find cohesive groups of nodes which have similar 
connectivity patterns as well as attribute coherence. Second we introduce CompreX, for iden¬ 
tifying anomalies using pattern-based compression. The main idea behind these work is to build 
a compression model that describes the norm of the data succinctly, and subsequently flag those 
points that are dissimilar to the norm—those with high compression cost—as anomalies. 

In Chapter 12, we look at the problem of detecting change-points in a time-varying graph at 
which many nodes deviate from their normal ‘behavior’. In a nut-shell, the main idea of our 
method is as follows. We first extract time sequences of several network features for all nodes 
in the graph. Then, we derive a ‘behavior’ vector for the nodes over consecutive time windows 
and compare each one to a summary of previous ‘behavior’ vectors. We flag a time window 
as anomalous and report that an event has occurred if its ‘behavior’ is found to be significantly 
different than its past. 

In Chapter 13, we develop a novel method for sensemaking for graph anomalies. The methods 
like OddBall, PICS, and CompreX we introduce in earlier chapters help us find anomalous 
nodes in a given graph. Our new method, called DOT2DOT 1 , helps us to understand how these 
nodes are correlated within the graph and thus to further summarize the anomalies. In particular, 
it groups the flagged anomalous nodes into groups where the nodes in the same group are ‘close’ 
to each other in the graph while nodes across groups are ‘sufficiently’ far apart. For the ‘close-by’ 
nodes in the same group, Dot2Dot finds a small candidate graph around them for visualization 
and it highlights good connection pathways among them. The main idea is to build a compression 
framework and employ information-theoretic ideas to select the best set of connection subgraphs 
that would yield the minimum compression cost for the given subset of nodes in the graph. 

All the proposed methods are designed to work in an unsupervised fashion. That is, they do not 
assume that there exist any labeled anomalous data to leam from. Furthermore, the algorithms 
are developed so that their running time scale linearly with respect to the input graph size, as well 
as they operate parameter-free, i.e. require no user-specified parameters (similarity thresholds, 
number of clusters, etc.). 

'Source code of OddBall, PICS, and Dot2Dot are at www. cs . emu. edu/~ lakoglu/#code 
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8.2 Definitions 


In this section, we introduce basic terminology and present the Minimum Description Length 
(MDL) principle. MDL serves as the main model selection principle we use for our methods in 
this part of the thesis. 


Egonets of a Graph 

Borrowing terminology from social network analysis (SNA), an “ego” is an individual node. A 
graph has as many egos as it has nodes. Egos can be people, groups, organizations, or whole 
societies, depending on what the nodes of the graph represent. 

The “k-step neighborhood” of node i is the collection of node i, all its k- step-away nodes, and all 
the connections among all of these nodes - formally, this is the “ induced subgraph ”. In SNA, the 
1-step neighborhood of a node is specifically known as its “ egonet ”. It contains the “ego” (center 
of interest) and all the nodes (neighbors) and edges directly around the ego. Formally, 

Definition 5 (A>step neighborhood) Given a graph Q, node i in Q, and a non-negative integer k, 
let — {j £ V(Q) '■ d(i,j) < k} denote the k-step away neighbors of node i, where d(i,j ) 
is defined as the minimum path length from node i to node j in Q. Then, the k-step neighborhood 
of node i, denoted as Q^ t k)(V, E), is defined as the induced subgraph, where V(Q^k)) = 
and E{g^ k )) = {( i,j ) G E(G) :i,j G A 


Minimum Description Length principle 

The Minimum Description Length (MDL) principle [Griinwald, 2007] is a practical version of 
Kolmogorov Complexity [Li and Vitanyi, 1993], and can be regarded as a model selection crite¬ 
rion based on lossless compression principles. Both embrace the slogan Induction by Compres¬ 
sion. For MDL, this can be roughly described as follows: 

Given a set of models, or hypotheses, TL, the best model II G Ti is the one that minimizes 

L(H) + L(D | H) , 

in which L{H ) is the length in bits of the description of H, and L{D j H) is the length of the data 
when encoded with model H. That is, the MDL-optimal model for a database D encodes D most 
succinctly among all possible models; it provides the best possible lossless compression. 

MDL provides us a systematic approach for selecting the model that best balances the complexity 
of the model and its fit to the data. While models that are overly complex may provide an 
arbitrarily good fit to the data and thus have low L{D \ H), they overfit the data, and are penalized 
with a relatively high L(H). Overly simple models, on the other hand, have very low L(H), but 
as they fail to identify important structure in D, their corresponding L ( I) \ H) tends to be 
relatively high. As such, the MDL-optimal model provides the best balance between model 
complexity and goodness of fit. 
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The above definition is called two-part MDL, or crude MDL—as opposed to refined MDL, where 
model and data are encoded together [Griinwald, 2007]. We use two-part MDL because we are 
specifically interested in the model. Further, although refined MDL has stronger theoretical 
foundations, it cannot be computed except for some special cases. Note that MDL requires the 
compression to be lossless in order to allow for fair comparison between different H e TT. 

To use MDL, we have to define what our models TL are, how a H e TL describes the data at 
hand, and how we encode this all in bits. For example in the following chapters, we will use 
block-encoding or code-table encoding as our models. Note that in MDL we are only concerned 
with code lengths, not actual code words. 


8.3 Related Work 


Related work of this part include outlier detection in clouds of multi-dimensional data points, 
anomaly detection in graph data, event detection in time-series data, data summarization, graph 
clustering, and connection subgraphs, which we discuss in the given order next. 


Outlier Detection in Multi-dimensional Data 

Outlier detection has attracted wide interest, being a difficult problem, despite its apparent sim¬ 
plicity. Even the definition of the outlier is hard to give: For instance, [Hawkins, 1980] defines an 
outlier as “an observation that deviates so much from other observations as to arouse suspicion 
that it was generated by a different mechanism.” Similar, but not identical, definitions have been 
given by others: [Barnett and Lewis, 1994], states that an outlier is one that appears to deviate 
markedly from the other members of the sample in which it occurs. [Johnson and Wichem, 
1998]. defines an outlier to be an observation in a data set that appears to be inconsistent with the 
remainder of that set of data. These definitions are formalized by [Knorr and Ng, 1998]: 

Definition 6 (DB(pct,d mm )) An object p in a data set D is a DB(pct , drain)-outlier if at least 
percentage pet of the objects in D lies greater than distance d min from p. That is, the cardinality 
of the set {q G D\d(p, q) > d min } is greater than pct% of the size of D, where d(p. q) denotes 
the distance between object q and object p in D. 

Outlier detection methods form two classes, parametric and non-parametric . The former class 
includes statistical methods that assume there exists a standard underlying distribution of the 
observations that fit the data [Barnett and Lewis, 1994; Hawkins, 1980]. and those observa¬ 
tions that deviate from the model assumptions are flagged to be outliers. These methods exhibit 
problems for high dimensional data sets and for settings where the underlying data distribution 
is unknown [Papadimitriou et al., 2003]. The latter class includes distance-based and density- 
based data mining methods. These methods typically define as an outlier the (n-D) point that is 
too far away from the rest, and thus lives in a low-density area [Knorr and Ng, 1998]. Typical 
methods include LOF [Breunig et al., 2000b], LOCI [Papadimitriou et al., 2003], and recently 
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LSOD [Wang et al., 2011]. These methods not only flag a point as an outlier but they also give 
outliemess scores; thus, they can sort the points according to their “strangeness”. 

Many other density-based methods that perform well in detecting outliers in very large datasets 
of high dimension are proposed in [Aggarwal and Yu, 2001; Aming et al., 1996; Chaudhary et al., 
2002; Guha et al., 2001; Kaufman and Rousseeuw, 1990; Ng and Han, 1994; Zhang et al., 1996]. 
Feature bagging [Lazarevic and Kumar, 2005] also proves to be useful to tackle high dimen¬ 
sionality, where features are randomly grouped into multiple sets of different sizes and outlier 
detection algorithms are performed on each different set after which the scores are combined. 
Finally, most clustering algorithms [Guha et al., 2001; Kaufman and Rousseeuw, 1990; Ng and 
Han, 1994; Zhang et al., 1996] reveal outliers as a by-product. 

While most work on outlier detection has focused on numerical datasets, there also exist some 
work on unsupervised anomaly detection in categorical data. [Bronstein et al., 2001] leams 
a structure and the parameters of a Bayesian network, and use the log-likelihood values as the 
anomalousness score of each record. [Das and Schneider, 2007; Wong et al., 2003] address the 
problem of finding anomaly patterns. They build a Bayes net that represents the baseline distri¬ 
bution, and then score all one and two component rules with unusual proportions compared to the 
baseline, with the aim of detecting rules that summarize significant patterns of anomalies. [Smets 
and Vreeken, 2011] takes a pattern-based compression approach and employs Krimp [Siebes 
et al., 2006] as its compressor for anomaly detection. 

For a comprehensive survey on outlier detection techniques, see [Chandola et al., 2009]. 


Anomaly Detection in Graph Data 

Compared to outlier detection, anomaly detection in graph-based data has only recently gained 
attention. Current work in this area can be divided into two areas, discovering anomalies in static 
snapshots of graphs and in time series of graphs. Spotting anomalies in time-evolving graphs, 
which we include in the next sub-section, largely focuses on detecting anomalous points in time 
during the evolution of a graph by using different graph editing distances. 

[Noble and Cook, 2003] detect anomalous sub-graphs using variants of the Minimum Description 
Length principle. [Eberle and Holder, 2007] also use the MDL principle as well as other prob¬ 
abilistic measures to detect several types of anomalies (e.g. unexpected/missing nodes/edges). 
[Liu et al., 2005] detect non-crashing bugs in software using frequent execution flow graphs 
combined with classification. They extend this work, using graph properties such as connect¬ 
edness and clustering coefficient to show that normal graphs and intentionally altered graphs 
significantly differ in such properties [Eberle and Holder, 2006]. [Chakrabarti, 2004] uses MDL 
to spot outlier edges that lie across clusters. [Gao et al., 2010] spots anomalous nodes that do 
not well belong to the community they reside, which they call community outliers. [Sun et al., 
2005] use proximity and random walks, to assess the normality of nodes in bipartite graphs. Out- 
Rank [Ghoting et al., 2004] and LOADED [Moonesinghe and Tan, 2008] use similarity graphs 
of objects to detect outliers. [Shetty and Adibi, 2005] use an entropy-based approach, flagging 
important nodes that increase the entropy of a graph when they are removed. 
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Event Detection in Time-series Data 


Anomaly detection has been studied widely in many settings such as ‘anomalous point detec¬ 
tion’ on clouds of multi-dimensional points, spatio-temporal ‘anomalous pattern detection’, and 
‘graph anomaly detection’, with many motivating applications such as network intrusion detec¬ 
tion [Sequeira and Zaki, 2002], detection of medical insurance claim fraud, credit card fraud, 
electronic auction fraud [Bolton and David, 2002; Chau et al., 2006], fault detection in engineer¬ 
ing systems [Fujimaki et al., 2005] as well as many others. 

Another class of anomaly detection methods can be grouped under ‘change-point detection’ 
on a sequence of time series of data, which address the problem of spotting points in time at 
which properties of the time-series data change significantly [Basseville and Nikiforov, 1993; 
Brodsky and Darkhovsky, 1993; Gustafsson, 2000; Kawahara et al., 2007; Kifer et al., 2004; 
Yamanishi and ichi Takeuchi, 2002]. This problem is also referred as event detection [Guralnik 
and Srivastava, 1999]. 

Although the change-point detection problem has been actively studied in the statistics and the 
data mining communities over the last several decades, there has been much less focus on change- 
point detection particularly in graph data. [Ide and Kashima, 2004] developed an eigen-vector 
based algorithm to detect faults in multi-tier Web-based systems represented as a time sequence 
of graphs. Another set of research [Bunke and Shearer, 1998; Shoubridge et al., 2002] derives 
distance functions between a pair of graphs, compute distances between consecutive graphs in 
a given sequence, and finally apply traditional anomaly detection methods on the time series of 
distance values. [Sun et al., 2007] propose a parameter-free algorithm to discover communities 
in streams of graph and flag points in time as discontinuity points when the community structure 
changes significantly. 


Data Description and Summarization 

Our CompreX method describes data using sets of patterns, so research on pattern set mining 
is related. Krimp algorithm by [Siebes et al., 2006; Vreeken et al., 2011] employs the MDL 
principle [Griinwald, 2007] to define the best set of patterns as those patterns that compress the 
data best. Krimp heuristically approximates the optimal pattern set by considering a collection 
of itemsets in a fixed order, greedily selecting itemsets that contribute to better compression. The 
code tables it discovers have been shown to characterize the data distribution very well, providing 
highly competitive performance on a variety of data mining tasks [Smets and Vreeken, 2011; 
Vreeken et al., 2011]. 

The Pack algorithm [Tatti and Vreeken, 2008] also follows a bottom-up MDL approach, us¬ 
ing a decision tree per attribute to encode binary data 0/1 symmetrically. By translating these 
trees into downward-closed families of itemsets, patterns can be extracted. Although good com¬ 
pression results are obtained, by the downward-closed requirement typically large groups of 
itemsets are returned. Further examples of pattern set mining methods that describe data include 
Tiling [Geerts et al., 2004], Noisy Tiling [Kontonasios and De Bie, 2010], and Boolean ma- 
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trix factorization [Miettinen et al., 2008]. As these do not define a probability (or code length) 
distribution for individual rows, it is not trivial to apply them for anomaly detection. Attribute 
clustering [Mampaey and Vreeken, 2010] provides a high-level summarization of the data, where 
the authors employ MDL to find the optimal grouping of binary attributes. These groups are de¬ 
scribed using crude code tables, consisting of all attribute-value combinations in the data. 


Graph Clustering 

Graph partitioning has been well studied in the literature. The top-performing methods include 
METIS [Karypis and Kumar, 1998] and spectral clustering [Ng et al., 2001]. However, these 
as well as many other graph partitioning methods [Andersen et al., 2006; Flake et al., 2000; 
Girvan and Newman, 2002] work with the connectivity structure of the graph and cannot be 
directly for attributed graphs. More importantly, they require the number of partitions and a 
measure of imbalance between any two partitions as input. These are usually hard for the user 
to specify, especially for large graphs, and require experimentation to get good results. For 
example, spectral partitioning requires the choice from several measures such as ratio cut [Chan 
et al., 1993], normalized cut [Shi and Malik, 2000], or min-max cut [Ding et al., 2001]. 

The idea of using (lossy) compression for graph clustering is introduced in [Dhillon et al., 2003]. 
The information-theoretic co-clustering algorithm simultaneously clusters rows and the columns 
of a normalized contingency table which is treated as a two-dimensional joint probability distri¬ 
bution. The algorithm, however, requires the number of row and column clusters as input. 

In terms of parameter-free graph clustering algorithms, Autopart [Chakrabarti, 2004] and cross¬ 
associations [Chakrabarti et al., 2004a], and their extension to time-evolving [Sun et al., 2007] 
and k-partite graphs [He et al., 2009] are the most representative. These methods use the mini¬ 
mum description length (MDL) principle [Griinwald, 2007] to automatically choose the number 
of clusters. However, they do not apply to attributed graphs as they operate on the adjacency 
matrix of the graph, and consider only the connectivity structure of the graph. 

Compared to the wide range of work on graph clustering, there has been much less focus on 
clustering attributed graphs. [Hanisch et al., 2002] transforms the graph and the attributes to a 
combined distance metric and then applies flat clustering. This and other similar methods [Tian 
et al., 2008] achieve homogeneity of attributes for the nodes in the same cluster, however they 
tend to yield low intra-cluster connectivity. [Zhou et al., 2009, 2010] transforms the attributes to 
additional nodes in the graph, where original nodes are linked to attribute nodes that they exhibit. 
Again, all these methods require the number of clusters to be exclusively specified. 

Recently, [Gunnemann et al., 2010; Moser et al., 2009] propose methods to extract cohesive 
subgraphs from an attributed graph rather than partitioning the entire graph. The subgraphs 
exhibit high density and homogeneity in a subset of their attributes. In these methods other types 
of parameters need to be set by the user; these include the subspace dimensionality and density 
thresholds, as well as the minimum number of nodes in each cluster. 

The spectral relational clustering algorithm [Fong et al., 2006] performs collective factorization 
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of related matrices for multi-type relational data clustering. Although the method is applicable to 
graph data with attributes, the user intervention there is two-fold; besides the number of clusters 
for each type of nodes, reasonable weights for different type of relations or attributes also need to 
be specified. In addition, it is not easy to do spectral clustering on directed graphs with attributes 
(there exists spectral clustering for plain directed graphs, although it is much more complicated 
than those for un-directed graphs). 


Connection Subgraphs and Visualization 

Our DOT2DOT method finds connection subgraphs among top anomalous nodes for better char¬ 
acterization and visualization. The connection subgraphs mining problem is introduced by [Falout- 
sos et al., 2004]. As they formulated, a connection subgraph is a relatively small, connected sub¬ 
graph that well captures the relationship (set of interesting paths) between two nodes in a given 
graph. Later, [Ramakrishnan et al., 2005] and [Sevon and Eronen, 2008] extended [Faloutsos 
et al., 2004] for querying labeled graphs in which nodes and edges belong to certain types or re¬ 
lations. On similar lines, [Jin et al., 2007] and [Shahaf and Guestrin, 2010] propose techniques to 
discover a coherent chain of evidence trails that connect two given concepts in text documents. 
All these methods mentioned so far, however, are appropriate for only two query nodes; thus, 
they do not generalize for more than two input nodes. 

The Center-Piece Subgraph Problem introduced by [Tong and Faloutsos, 2006; Tong et al., 2007] 
finds the most ‘centered’ node that has strong connections to most or all of the query nodes, and 
keeps adding important paths from the center-piece node to active nodes until an upper bound on 
the number of nodes in the output graph is reached. [Koren et al., 2006] introduce the Cycle-Free 
Effective Conductance measure for graph proximity and propose methods to extract the subgraph 
with at most a certain size which retains most of the proximity between query nodes. Proximity 
and connection subgraphs are also exploited for graph visualization and summarization [Chau 
et al., 2008, 2011; Jr. et al., 2006]. For a survey, see [Alsudairy et al., 2011]. 

Another set of related work focuses on expanding communities around a given set of seed 
nodes [Andersen and Fang, 2006; Andersen et al., 2006; Riedy et al., 2011; Spielman and Teng, 
2008]. These methods find a subgraph containing the seed nodes and differ in the specific graph 
measure l ik e its modularity, conductance, or clustering coefficient that they optimize. 

We find it worth emphasizing that graph clustering [Flake et al., 2000; Girvan and Newman, 
2002; Karypis and Kumar, 1998], although related, target a different problem. While their goal 
is to cluster the whole graph, we try to partition and connect a small subset of nodes, making use 
of the graph structure. 

[Bhalotia et al., 2002] develop methods to find effective connection trees among keywords for 
browsing and querying. While conceptually similar, our problem definition and formulation 
have important differences, such as incorporating encoding length. [Feskovec et al., 2007a] use 
features extracted from query connection subgraphs together with machine learning techniques 
to predict quality of query results. The connection subgraph is composed of the induced subgraph 
returned by the query where the its disconnected components are simply joined by shortest paths. 
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[Guan et al., 2011] study a new measure called structural correlation to assess how strongly a set 
of query nodes (e.g. nodes affected by an event) are correlated via the graph structure. Their 
method provides a correlation score but not a connection subgraph. [Tong et al., 2010] finds the 
top-k nodes with the highest ‘gateway-ness’ score with respect to a given source and target node 
set, such that they collectively lie on most of the short paths from source nodes to target nodes. 
Their goal, rather than to find strong connections between query nodes, is to find a set of nodes 
that disconnect source and target nodes to the largest extent when removed from the graph. 
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Chapter 9 

Anomalies in Plain Graphs 


PROBLEM STATEMENT: Given a large, weighted graph, how can we find anomalies ? Which 
rules should be violated, before we label a node as an anomaly? 


Most anomaly detection algorithms focus on clouds of multi-dimensional points, as we described 
in the survey section. Our goal, on the other hand, is to spot strange nodes in a graph , with 
weighted edges. What patterns and laws do such graphs obey? What features should we extract 
from each node? In this chapter, we answer all these questions and develop our OddBall 
method. Main contributions of this work are: 

1. Feature extraction: We propose to focus on neighborhoods, that is, a sphere, or a ball 
(hence the name OddBall) around each node(the ego): that is, for each node, we consider 
the induced subgraph of its neighboring nodes, which is referred to as the egonet. Out of 
the huge number of numerical features one could extract from the egonet of a given node, 
we give a carefully chosen list, with features that are both fast to compute, as well as 
effective in revealing outliers. 

2. Egonet patterns: We show that egonets obey some surprising patterns (like the Egonet 
Density Power Law {EDPL). EWPL, ELWPE , and ERPL), which gives us confidence to 
declare as outliers the ones that deviate. We support our observations by showing that the 
ERPL yields the EWPL. 

3. Scalable algorithm: Based on those patterns, we propose OddBall, a scalable, un¬ 
supervised method for anomalous node detection. 

4. Application on real data: We apply OddBall to numerous real graphs ( DBLP , political 
donations, and other domains) and we show that it indeed spots nodes that a human would 
agree are strange and/or extreme. 

Jumping ahead, the major types of anomalous nodes we can spot are as follows (see Fig.9.1 for 
examples). 

1. Near-cliques and stars: Detecting those nodes whose neighbors are very well connected 
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(near-cliques) or not connected (stars) turn out to be “strange”: in most social networks, 
friends of friends are often friends, but either extreme (clique/star) is suspicious. 

2. Heavy vicinities'. If person i has contacted n distinct people in a who-calls-whom network, 
we would expect that the number of phonecalls (weight) would be proportional to n (say, 
3x77, or 5x77). Extreme total weight would be suspicious, indicating, e.g., faulty equipment 
that forces redialing. 

3. Dominant heavy links : In the who-calls-whom scenario above, a very heavy single link in 
the l-step neighborhood of person i is also suspicious, indicating, e.g., a stalker that keeps 
on calling only one of his/her contacts an excessive count of times. 




(b) Near-clique (c) Heavy vicinity (d) Dominant edge 


(a) Near-star 


Figure 9.1: Types of anomalies that OddBall detects. Top row: toy sketches of egonets 


(ego shown in larger, red circle). Bottom row: actual anomalies spotted in real 
datasets, (a) A near-star in PostNet: Instapundit, on Hurricane Katrina relief agen¬ 
cies (instapundit . com/archives/025235 .php): An extremely long post, 
with many updates, and numerous links to diverse other posts about donations, 
(b) A near-clique in PostNet'. sizemore.co.uk, who often linked to its own 
posts, as well as to its own posts in other blogs, (c) A heavy vicinity in PostNet'. 
blog.searchenginewatch.com/blog has abnormally high weight wrt the number of 
edges in its egonet. (d) Dominant edge(s) in CampOrg: In FEC 2004, George W. 
Bush received a huge donation from a single donor committee: Democratic National 
Committee (~$87M) (!) - in fact, this amount was spent against him; Next heaviest 
link (~$25M): from Republican National Committee. 


Next we give the primary observations and the description of OddBall, show experimental 
results, and list summary of contributions. 
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9.1 OddBall: Method Description 


The basic research questions we start with are: (a) what features should we use to characterize 
a neighborhood? and (b) what does a ‘normal’ neighborhood look like? Both questions are 
open-ended, but we give some answers below. 

First, we need to choose the value of k steps to study neighborhoods (see Definition 5). Given 
that real-world graphs have small diameter [Albert et al., 1999], we need to stay with small values 
of k, and specifically, we recommend k — 1. We report our findings only for k — 1, because 
using k > 1 does not provide any more intuitive or revealing information, while it has heavy 
computational overhead, possibly intractable for very large graphs. 


9.1.1 Feature Extraction 

The first of our two inter-twined questions is which statistics/features to extract from each node 
neighborhood. There is an infinite set of functions/features that we could use to characterize a 
neighborhood (number of nodes, one or more eigenvalues, number of triangles, effective radius 
of the central node, number of neighbors of degree 1, etc etc). Which of all should we use? 

Intuitively, we want to select features that (a) are fast-to-compute and (b) will lead us to pat- 
terns/laws that most nodes obey, except for a few anomalous nodes. We spend a lot of time 
experimenting with about a dozen features, trying to see whether the nodes of real graphs obey 
any patterns with respect to those features. The majority of features lead to no obvious patterns, 
and thus we do not present them. 

The trimmed-down set of features that are successful in spotting patterns, are the following: 

1. Ny. number of neighbors (degree) of ego i, 

2. E t : number of edges in egonet i, 

3. Wf. total weight of egonet i, 

4. X w y. principal eigenvalue of the weighted adjacency matrix of egonet i. 

5. Sf number of singleton neighbors of ego i of degree 1. 

The next question is how to look for outliers, in such an n-dimensional feature space, with one 
point for each node of the graph. In our case, n = 4, but one might have more features depending 
on the application and types of anomalies one wants to detect. A quick answer to this would be 
to use traditional outlier detection methods for clouds of points using all the features. 

In our setting, we can do better. As we show next, we group features into carefully chosen 
pairs, where we show that there are patterns of normal behavior (typically, power-laws). We 
flag those points that significantly deviate from the discovered patterns as anomalous. Among 
the numerous pairs of features we studied, the successful pairs and the corresponding type of 
anomaly are the following: 

• E vs N: CliqueStar : detects near-cliques and stars 

• W vs E: Heavy Vicinity, detects many recurrences of interactions 
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• X w vs W: DominantPair : detects single dominating heavy edge (strongly connected pair) 

• S vs N: LoneStar. detects almost-isolated stars 


9.1.2 Laws and Observations 

The second of our research questions is what do normal neighborhoods look like. Thus, it is im¬ 
portant to find patterns (“laws”) for neighborhoods of real graphs, and then report the deviations, 
if any. In this work, we report some new, surprising patterns: 

For a given graph Q, node i G V(Q), and the egonet Q, of node i; 

Observation 9.1.1 (EDPL: Egonet Density Power Law) the number of nodes Ni and the num¬ 
ber of edges Ej of Qi follow a power law. 

E t oc Nf, 1 < a < 2. 

In our experiments the EDPL exponent a ranged from 1.10 to 1.66. Fig. 9.3 illustrates this 
observation, for several of our datasets. Plots show E t versus N t for every node (green points); 
the black circles are the median values for each bucket of points (separated by vertical dotted 
lines) after applying logarithmic binning on the x-axis as in [Mcglohon et al., 2008]; the red line 
is the least squares(LS) fit on the median points. The plots also show a blue line of slope 2, that 
corresponds to cliques, and a black line of slope 1, that corresponds to stars. All the plots are in 
log-log scales. 

Observation 9.1.2 (EWPL: Egonet Weight Power Law) the total weight Wi and the number 
of edges Ej of Q, follow a power law. 

W t (x iff, ft > 1. 

Fig. 9.5 shows the EWPL for (only a subset of) our datasets (due to space limit). In our experi¬ 
ments the EWPL exponent ft ranged up to 1.29. Values of ft > 1 indicate super-linear growth in 
the total weight with respect to increasing total edge count in the egonet. 

Observation 9.1.3 (ELWPL: Egonet X w Power Law) the principal eigenvalue A,,,., of the weighted 
adjacency matrix and the total weight Wi of Qi follow a power law. 

A w ,i oc Wft, 0.5 < 7 < 1. 

Fig. 9.6 shows the ELWPL for a subset of our datasets. In our experiments the ELWPL exponent 
7 ranged from 0.53 to 0.98. 7=0.5 indicates uniform weight distribution whereas 7-I indicates 
a dominant heavy edge in the egonet, in which case the weighted eigenvalue closely follows the 
maximum edge weight. 7=1 if the egonet contains only one edge. 

Observation 9.1.4 (LRPL: Egonet Rank Power Law) the rank R l%) and the weight W tJ of edge 
j in Qi follow a power law. 

W,.j x /('A 0 < 0. 
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Figure 9.2: Weight W t .j vs. Rank R r , j for each edge j in the egonet of node i. Top three nodes 
with the highest edge count in their egonet from BlogNet are shown. LS line is tit in 
log-log scales pointing out a power-law relation ( ERPL ). 


Here, /?, M is the rank of edge j in the sorted list of edge weights. ERPL suggests that edge 
weights in the egonet have a skewed distribution. This is intuitive; for example in a friendship 
network, a person could have many not-so-close friends (light links), but only a few close friends 
(heavy links). In Fig. 9.2 we show the ERPL for top three nodes with the highest number of 
edges in their egonet from BlogNet - other datasets have similar results. 

Next we show that if the ERPL holds, then the EWPL also holds. Given an egonet Q t , the total 
weight Wi and the number of edges E, of Q,, let W, denote the ordered set of weights of the 
edges, W i} j denote the weight of edge j, and R, :) denote the rank of weight in set WThen, 


Lemma 5 ERPL implies EWPL, that is: IfWij oc R ( [y 9 < 0, then 


Wi cx E? { 


13 = 1, if-l <9 < 0 
0 > 1, if 9 < -1 


Proof 


For brevity, we give the proof for 9 < —1 - other cases are similar. If ] = cR f •, then 
W mm = cEf -the least heavy edge l with weight W min is ranked the last, i.e. R, j = E t . Thus 
we can write W r as 


W t = W min E~ e ( ) - WminE, 

yj = 1 


—9 


jdj 


' j =i 


~W^ m i n E^ 


-9 f J 


a9+ 1 


9+1 


Ei 


3 = 1 




-9 


-9 -1 (-9 - m 


- 9-1 


For sufficiently large E % and given 9 < —1, the second term in parenthesis goes to 0. Therefore; 
Wi « c'E~ e , d = . Since 9 < -1, 0 > 1. | 
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9.1.3 Anomaly Detection 


We can easily use the observations given in part 3.2 in anomaly detection since anomalous nodes 
would behave away from the normal pattern. Let us define the y-value of a node i as y t and 
similarly, let Xi denote the a-value of node i for a particular feature pair f(x,y). Given the 
power law equation y = Cx e for f(x, y), we define the outlierness score of node i to be 


out-line(i) 


max(yi, Cx?) 
min(yi , Cxf) 


* log{\yi 


C x i\ + 1 ) 


Intuitively, the above measure is the “distance to fitting line”. Here we penalize each node with 
both the number of times that y, deviates from its expected value Cxf given x^ and with the 
logarithm of the amount of deviation. This way, the minimum outlierness score becomes 0 -for 
which the actual value y { is equal to the expected value Cx ( f 

This simple and easy-to-compute method not only helps in detecting outliers, but also provides 
a way to sort the nodes according to their outlierness scores. However, this method is prone to 
miss some outliers and therefore could yield false negatives for the following reason: Assume 
that there exist some points that are far away from the remaining points but that are still located 
close to the fitting line. In our experiments with real data, we observe that this usually happens 
for high values of x and y. For example, in Fig. 9.3(a), the points marked with left-triangles (<) 
are almost on the fitting line even though they are far away from the rest of the points. 

We want to flag both types of points as outliers, and thus we propose to combine our heuristic 
with a density-based outlier detection technique. We used LOF [Breunig et al., 2000b], which 
also assigns outlierness scores out-lof(i) to data points; but any other outlier detection method 
would do, as long as it gives such a score. To obtain the final outlierness score of a data point 
i, one might use several methods such as taking a linear function of both scores and ranking the 
nodes according to the new score, or merging the two ranked lists of nodes, each sorted on a 
different score. In our work, we simply used the sum of the two normalizedfby dividing by the 
maximum) scores, that is, out-score(i) = out-line(i)+out-lof(i). 


9.2 Experimental Results 

9.2.1 Structural Anomalies 

CliqueStar and the LoneStar are used for finding structural anomalies. Here, we are interested 
in the communities that neighbors of a node form. In particular, CliqueStar detects anomalies 
having to do with near-cliques and near-stars and LoneStar catches those nodes whose egonet is 
weakly connected to the rest of the graph: most of the neighbors being of degree 1, the egonet is 
not only star-like, but also is almost isolated from the rest of the network. 
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CliqueStar 


Here, we are interested in the communities that neighbors of a node form. In particular, CliqueS¬ 
tar detects anomalies having to do with near-cliques and near-stars. Using CliqueStar, we were 
successful in detecting many anomalies over the unipartite datasets (although it is irrelevant for 
bipartite graphs since by nature the egonet forms a “star”). 

In social media data PostNet and BlogNet, we detected posts or blogs that had either all their 
neighbors connected (cliques) or mostly disconnected (stars). We show some illustrative exam¬ 
ples along with descriptions from PostNet in Fig. 9.1. See Fig.9.3a for the detected outliers on 
the scatter-plot from the same dataset. In BlogNet(Fig.9.3b), the method detected several “link 
blogs”, blogs devoted to posting links to a wide array of sources that do not always have sim¬ 
ilar content. For instance mf isn. com links to tech blogs, politics stories, and flash cartoons, 
dev. upian . com/hot links also links to a wide range of other posts each day. 

In Enron{ Fig.9.3c), the node with the highest outlier score turns out to be Kenneth Lay, who was 
the CEO and is best known for his role in the Enron scandal in 2001. Our method reveals that 
none of his over IK contacts ever sent emails to each other. In Oregon (Fig.9.3d), the top outliers 
are the three large ISPs (Verizon, Sprint and AT&T). 


LoneStar 

Some results of this method were similar to those of CliqueStar, but for different reason: here, 
we look not simply at whether the neighbors connect to each other, but whether they connect 
to the rest of the network at large. This gives important implicit information about the 2-step 
neighborhood. Thus, we are looking for “isolated stars” or “hubs”. 

In CampOrg, the most outlying Committee (Fig.9.4a) is the National Committee for an Effective 
Congress, which donated to candidates who received money from nowhere else. This makes 
intuitive sense as this committee claims to help “progressive” candidates 1 , that may have limited 
support otherwise. On the other hand, Candidates (Fig.9.4b) reveals the top two competing 
candidates, John F. Kerry and George W. Bush, who had several committees devoted exclusively 
to their campaigns. 

In Camplndiv, top outliers for Committees (Fig.9.4d) are Bush-Cheney ’04 Inc. and John Kerry 
for President Inc. both of which raised money from a vast number of donors, most of whom did 
not donate to anywhere else. Another interesting result here is that Kerry Victory 2004 received 
money from donors who donated to other committees as well, but as we will see later, Kerry 
Victory 2004 donated money only to a single candidate, John F. Kerry. 

In Auth-Conf (Fig.9.4c), we see that top outliers include ICRA, ISCAS and SIGUCCS with their 
authors publishing to no other conferences. This might be due to the fact that these conferences 
are the only ones related to a particular research area. Winter Simulation Conference, another 
anomaly here, happens to be a conference of a unique type. 

1 www.ncec.org 
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(b) BlogNet 




(c) Enron 


(d) Oregon 


Figure 9.3: Illustration of the Egonet Density Power Law ( EDPL ), and the corresponding 
anomaly CliqueStar, with outliers marked with triangles. Edge count versus node 
count (log-log scale); red line is the LS fit on the median values (black circles); 
dashed black and blue lines have slopes 1 and 2 respectively, corresponding to stars, 
and cliques. Top outlying points are enlarged. Most striking outlier: Ken Lay (CEO 
of Enron), with a star-like neighborhood. See text for more discussion and Fig.9.1 
for example illustration from PostNet. 


In BlogNet (Fig.9.4e), the same “link blogs” as above are detected as “isolated stars”: most of 
their neighbors are not only blogs that do not link to each other, but in fact are linked to nobody 
else at all. There were similar points in PostNet , where the post has many links to otherwise 
isolated posts. For example, one outlier is KBCafe 2 , a homepage to a piece of email software 
linked several times by a single blog in different posts. See Fig.9.1 for further details. 

In PostNet (Fig.9.4f), we also found points on the other extreme: posts that were unusually well- 
connected. Looking further, we found that there were several feeds that claimed that the post 
permalink was the actual blog home page- this boosted the degree to these so-called “posts”, 
since whenever another post linked to the homepage of this blog, we registered it as linking to 

2 www.kbcafe.com/rmail.aspx 
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Figure 9.4: Illustration of the near-isolated star anomaly LoneStar. Plots show total number 
of nodes of degree 1 in the egonet vs. degree of the ego for all nodes(in log-log 
scale). Most striking anomalies are marked with triangles and labeled, with detailed 
explanation in text. See Fig.9.1 for an example illustration from PostNet. 


a single post. These outliers 3 had many neighbors among political blogs which pointed to other 
blogs, so they were unusually non-isolated. So, in this case, anomalous behavior signaled errors 
in the feeds. 

In Oregon (Fig.9.4g) and Enron (Fig.9.4h), top 3 detected points were the same as the ones 
detected with CliqueStar. This is intuitive as nodes that form a near-star are expected to have 
low degrees- indeed degree 1, making the star not only sparse but also isolated. 

3 www.warandpiece.com and www.liberaloasis.com 
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9.2.2 Weight Anomalies 


HeavyVicinity and the DominantPair are used for finding weight anomalies. Here, we are inter¬ 
ested in the weights of edges in the egonets of nodes. In particular, HeavyVicinity detects anoma¬ 
lies having to do with high weight neighborhoods and DominantPair catches those egonets in 
which a particular edge weight dominates the others. 


HeavyVicinity 

HeavyVicinity detects nodes that have considerably high total edge weight compared to the num¬ 
ber of edges in their egonet. In our datasets, HeavyVicinity detected “heavy egonets”, with 
anomalies marked in Fig.9.5. In PostNet (Fig.9.5h), often anomalous posts were the ones that 
linked to another post repeatedly 4 or were listed as the blog’s home page 5 . 

In BlogNet (Fig.9.5g), we detected blogs that linked to just a few other sources, either a single 
post or multiple posts from the same blog. Interesting anomalies are Automotive News Today 6 , 
which linked 241 times to the GM blog 7 , but never to any other blog in our dataset. One political 
blog 8 had 1816 total in- and out-links, to only 30 other blogs. On the other extreme, Bandelier 9 
had an abnormally high number of edges, but a low weight upon each. The blog had a single post 
in the dataset, so it was naturally uncommon for blogs to link to that blog multiple times. 

HeavyVicinity revealed interesting observations in bipartite graphs as well, spotting duplicates 
and irregularities. In CampOrg (Fig.9.5a,b), we see that Democratic National Committee gave 
away a lot of money compared to the number of candidates that it donated to. In addition, (John) 
Kerry Victory 2004 donated a large amount to a single candidate, whereas Liberty Congressional 
Political Action Committee donated a very small amount ($5) to a single candidate. Looking at 
the Candidates plot for the same bipartite graph, we also flagged Aaron Russo, the lone recipient 
of that PAC. (In fact, Aaron Russo is the founder of the Constitution Party which never ran any 
candidates, and Russo shut it down after 18 months.) 

In Camplndiv (Fig.9.5c,d), we see that Bush-Cheney ’04 Inc. received a lot of money from a 
single donor. This is strange, as it was also listed as an anomaly in LoneStar, having many 
degree 1 donors. Looking at the data, we notice that that committee is listed twice with two 
different IDs. 

On the other hand, we notice that the Kerry Committee received less money than would be 
expected looking at the number of checks it received in total. Further analysis shows that most 
of the edges in its egonet are of weight 0, showing that most of the donations to that committee 
have actually been returned. 

4 leados.blogs.com/blog/2005/08/overview.of _cia.html 

5 blog.searchenginewatch.com/blog 

6 www.automotive-news-today.com 

7 fastlane.gmblogs.com 

8 nocapital.blogspot.com 

9 www.bandelier.com 
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Figure 9.5: Illustration of the Egonet Weight Power Law ( EWPL ) and the weight-edge anomaly 
HeavyVicinity. Plots show total weight vs. total count of edges in the egonet for all 
nodes(in log-log scales). Detected outliers include Democratic National Committee 
and John F. Kerry (in FEC campaign donations), and Averill M. Law in DBLP See 
text for more discussions, and Fig.9.1 for an illustrative example from PostNet. 


In Auth-Conf (Fig.9.5e,f), Averill M. Law published 40 papers to the Winter Simulation Confer¬ 
ence and nowhere else. This might be due to the fact that there exists no other conference that 
would capture the interest of that author. In fact, under LoneStar, we saw that Winter Simulation 
Conference was one of those conferences with most of its authors with degree 1, pointing to the 
same possibility that it is a unique conference in a particular area. 

Other interesting points are Wei Wang and Wei Li. Those authors have many papers, but they 
publish them at as many distinct conferences, probably once or twice at each conference. 
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Figure 9.6: Illustration of the Egonet A„. Power Law ( ELWPL ) and dominant heavy link anomaly 
DominantPair. Top anomalies are marked with triangles and labeled. See text for 
details for each dataset and Fig.9.1 for an illustrative example from CampOrg. 


DominantPair 


DominantPair measures whether there is a single dominant heavy edge in the egonet. In other 
words, this method detected “bursty” if not exclusive edges. In PostNet (Fig.9.6h) nodes such 
as ThinkProgress’s post on a leak scandal 10 and A Freethinker’s Paradise post 11 linking several 
times to the ThinkProgress post were both flagged. In BlogNet (Fig. 9.6g), we detected a Drudge 

10 www.thinkprogress.org/leak-scandal 

11 leados.blogs.com/blog/2005/08/overview.of _cia.html 
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blogger 12 , who had 298 links, all but 4 to another blogger in the same site 13 . Nocapital also 
appeared here, since it had around 300 links each to two other blogs. 

In CampOrg (Fig.9.6a,b), Democratic National Committee is one of the top outliers. We would 
guess that the single large amount of donation was made to John F. Kerry. Counterintuitively, we 
see that that amount was spent for an opposing advertisement against George W. Bush. 

In Camplndiv (Fig.9.6c,d), there exists points above the slope 1 line. Further inspection shows 
that these points correspond to authorities having negative weighted edges due to returns. Points 
below the fitting line, such as Dean For America on the Committees plot, correspond to authori¬ 
ties with even weight distributions on the edges, without a specific dominant heavy link. 

DominantPair flagged extremely focused authors (those publish heavily to one conference) in the 
DBLP data, shown in Fig. 9.5(e,f). For instance, Toshio Fukuda has 115 papers in 17 conferences 
(at the time of data collection), with more than half (87) of his papers in one particular conference 
(ICRA). Alberto-Sangiovanni Vincentelli published 279 papers to 45 distinct conferences, with 
76 of his papers in DAC. On the Conferences plot, Programming Languages and Their Definition 
has 21 papers from 6 authors, with 16 papers by one particular author (Hans Bekic). 


9.2.3 Scalability 

Major computational cost of our method is in feature extraction. In particular, computing those 
features, such as the total number of edges and total weight, for the egonets is the bottleneck as 
one needs to find the induced 1-step neighborhood subgraphs for all nodes in the network. 

The problem of finding the number of edges in the egonet of a given node can be reduced to 
the problem of triangle counting. One straightforward listing method for local triangle counting 
is the Node-Iterator algorithm. Node-Iterator considers each one of the N nodes and examines 
which pairs of its neighbors are connected. Time complexity of the algorithm is 0(Ndf iax ). 
Approximate streaming algorithms for local triangle counting can be applied to reduce the time 
complexity to 0(E log N ) with space complexity O(N) [Becchetti et al., 2008]. 

In our experiments, we use Eigen-Triangle , proposed in [Tsourakakis, 2008], which uses eigen¬ 
values/vectors to approximate the number of paths of length three, i.e. local triangle counts, 
without performing actual counting. We compare its performance to Node-Iterator , which gives 
exact counts. To improve speed even more, we propose pruning low degree nodes. Notice that 
removing degree-1 nodes has no effect on the number of triangles in the graph. We also try 
removing degree-2 nodes; this would remove some triangles but hopefully would not change the 
relative number of triangles across nodes drastically and still reveal similar outliers. 

In Figure 9.7a, we show computation time for Node-Iterator (green), and for Eigen-Triangle 
using 2(red), 10(blue), and 30(black) eigenvalues versus graph size in terms of number of edges 
for Enron {E « 180/1). Solid(-), dashed(-), and dotted(...) lines are for no pruning, after 

12 drudge.com/user/rcade 
13 drudge.com/user/gzlives 
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Figure 9.7: (a) Computation time of computing egonet features vs. number of edges in Enron. 

Effect of pruning degree-1 and degree-2 nodes on computation time of counting tri¬ 
angles. Solid(-): no pruning, dashed(-): pruning d < 1, and dotted(...): pruning 

d < 2 nodes. Computation time increases linearly with increasing number of edges, 
whereas it decreases with pruning, (b) Time vs. accuracy. Effect of pruning on accu¬ 
racy of finding top anomalies as in the original ranking before pruning. New rank¬ 
ings are scored using Normalized Cumulative Discounted Gain. Pruning reduces 
time for both Node-Iterator and Eigen-Triangle for different number of eigenvalues 
while keeping accuracy at as high as~ 1 and-.9, respectively. 


pruning d < 1, and d < 2 nodes, respectively. We empirically note that computation time 
grows linearly with increasing graph size and also reduces with pruning. (Experiments ran on 
a Pentium class workstation, with 16GB of RAM, running Linux Fedora Core. To account for 
possible variability due to system state, each run is repeated 10 times and execution time results 
are averaged. Error bars show the variance across repeated runs.) 

Pruning low-degree nodes and using Eigen-Triangle reduces computation time, however, they 
only provide approximate answers. In order to quantify their accuracy, we compare the rank 
list of outliers returned by Node-Iterator to the rank list of outliers returned by each approxi¬ 
mate method. To measure how rankings changed with approximation compared to the original 
rankings, we use Normalized Discounted Cumulative Gain(NDCG) which is prevailingly used in 
Information Retrieval for measuring the effectiveness of search engines. The premise of NDCG 
is that highly ranked items in the original list appearing lower in the approximated list are penal¬ 
ized logarithmically proportional to their positions in the approximated list. 

Figure 9.7b shows time vs. NDCG scores for Eigen-Triangle using 2, 5, 10 and 30 eigenvalues, 
and also Node-Iterator for top k anomalies. For brevity, we only show ranking scores for /,:= 100. 
*, +, and o symbols represent no pruning, pruning d < 1, and d < 2 nodes, respectively. 
We notice that while reducing computation time, pruning low degree nodes as well as using 
Eigen-Triangle keeps the accuracy at as high as-.9 for Eigen-Triangle(30), and above 0.55 for 
Eigen-Triangle( 2). 
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9.3 Summary of contributions 


In this chapter, we introduced one of few work that focus on anomaly detection in graph data, 
including weighted graphs. Our major contributions are summarized as the following: 

• Proposing to work with “egonets”, that is, the induced sub-graph of the node of interest and 
its neighbors; we give a small, carefully designed list of numerical features for egonets. 

• Discovery of new patterns that egonets follow, such as patterns in density (Obs.9.1.1: 
EDPL ), weights (Obs.9.1.2: EWPL), principal eigenvalues (Obs.9.1.3: ELWPL), and ranks 
(Obs.9.1.4: ERPE). Proof of Lemma 1, linking the ERPL to the EWPL. 

• OddBall 14 , a fast, unsupervised method to detect abnormal nodes in weighted graphs. 
Our method does not require any user-defined constants. It also assigns an “outlierness” 
score to each node. Possible approximations in feature extraction provide speed-up, keep¬ 
ing accuracy at as high as-0.9. 

• Experiments on real graphs of over 1 million nodes, where OddBall reveals nodes that 
indeed have strange or extreme behavior. 

Our future work will generalize OddBall to time-evolving graphs, where the challenge is to 
find patterns that neighborhood sub-graphs follow, track the behavior of nodes, and to extract 
features incrementally over time. 


l4 Source code of OddBall: www . cs . emu. edu/~ lakoglu/ #code 
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Chapter 10 


Anomalies in Binary-attributed Graphs 


PROBLEM Statement: Given a graph with node attributes, how can we find meaningful pat¬ 
terns such as clusters of nodes and clusters of attributes, as well as bridges and outliers? 


Real graphs often have nodes with attributes, in addition to connectivity information. For ex¬ 
ample, social networks contain both the friendship relations as well as user attributes such as 
interests and likes. Other examples include gene interaction networks with gene expression in¬ 
formation, phone call networks with customer demographics. Both types of information can be 
described by a graph in which nodes represent the objects, edges represent the relations between 
them, and attribute vectors associated with the nodes represent their attributes. Such graph data 
is often referred to as an attributed graph. 

Given such an attributed graph how can we find meaningful patterns, clusters of nodes, clusters 
of attributes, and anomalies? For example, consider the case of YouTube in which graph nodes 
represent users and YouTube-group memberships represent attributes. Given the who-friends- 
whom and who-belongs-to-which YouTube groups information, how can we summarize and 
make sense out of it? 

In this chapter, we introduce PICS for mining attributed graphs. Specifically, PICS finds cohe¬ 
sive clusters of nodes that have similar connectivity patterns and exhibit high attribute homogene¬ 
ity. PICS additionally groups the node attributes into attribute-clusters. These patterns further 
help us to spot anomalies and bridges. A key characteristic of PICS is that it is parameter-free', 
it can recover the number of necessary clusters without any user intervention. For example, our 
algorithm automatically finds coherent user clusters and YouTube-group clusters as shown in 
Figure 10.1. As a first observation, the algorithm automatically discovers clusters of users who 
like the ‘anime’ genre and form a so-called ‘core and periphery’ pattern: the core consists of the 
users in the bottom-right dark square of the blue square labeled as ‘2’; also notice that the anime 
fans overwhelmingly belong to a coherent set of YouTube-groups (blue square labeled as ‘A2’). 
As we show in Table 10.2, those groups include anime4ever, narutoholics, and crazyforanime. 
More discussion on YouTube results is given in experiments, §10.2.2. 
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Figure 10.1: PICS on YOUTUBE finds clusters of users with similar connectivity and attribute 
coherence. Left: the adjacency matrix (users-to-users); right: the attribute matrix 
(users-to-YouTubeGroups). Both matrices are carefully arranged by PICS, reveal¬ 
ing patterns: e.g., the anime fans are heavily connected, and they focus on the same 
YouTube-groups. See § 10.2.2 for more discussion. 


In a nut-shell, the contributions of this work are the following: 

• Algorithm design: We introduce PICS, a novel clustering method to summarize graphs 
with node attributes. In effect, it groups the nodes with similar connectivity patterns into 
cohesive clusters that have high attribute homogeneity. Besides, it also clusters the at¬ 
tributes into attribute-clusters. 

• Automation: PICS does not require any user-specified input such as the number of clusters, 
similarity functions or any sort of thresholds. 

• Scalability: The run time of PICS grows linearly with total input graph and attribute size. 

• Effectiveness: We evaluate our method on a diverse collection of real data sets with tens 
of thousands of nodes and attributes. Our results show that PICS successfully recovers 
cohesive node clusters, reveals the bridge and outlier nodes, and groups attributes into 
meaningful clusters. 

Next we highlight the main challenges associated with simply extending traditional clustering 
algorithms to solve our problem, followed by method description and experiment results. 
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Why not simple extensions? 

El Represent the relations of the objects as additional attributes, and then perform flat cluster¬ 
ing on this extended feature space. The challenges with this approach are two-fold: First, it 
leads to a very large number of features and thus is faced with the curse of dimensionality 
[Beyer et al., 1999]. Second, it yields two separate types of attributes where it is not clear 
how to weigh those different sets for clustering. 

E2 Represent the attributes of the objects as additional nodes in the graph, introduce new edges 
from the original nodes to the attribute nodes they exhibit, and then perform graph clus¬ 
tering on this extended graph. Again, there are two challenges with this approach: First, 
the graph grows with new nodes and edges, which may be quite numerous. Additionally, 
it is not clear how to do clustering in this heterogeneous graph which contains two types 
of nodes and edges. 

E3 Introduce edges between all pairs of nodes where edges are weighted by the existence 
of connectivity and attribute similarity: This approach requires quadratic computation of 
pair-wise similarities and is thus untractable for large graphs. It also requires a careful 
choice of a similarity function. 

In summary, clustering attributed graphs by applying traditional clustering approaches presents 
several nontrivial challenges. 


10.1 PICS: Method Description 

10.1.1 Problem Description 

In this work, we address the problem of finding cohesive clusters in an attributed graph. Specif¬ 
ically, given a graph with n nodes and their binary connectivity information, where nodes are 
associated with / binary attributes (interchangeably, features), our goal is to group the nodes 
into k, and group the features into l disjoint clusters such that the nodes in the same cluster have 
“similar connectivity” and also exhibit high “feature coherence” (interchangeably we also use 
the term feature similarity). Informally, a set of nodes have “similar connectivity” if the set of 
nodes in the graph they connect to “highly” overlap. Similarly, a set of nodes have high “feature 
coherence” if the set of features they exhibit “highly” overlap. 


Synthetic Graph Example 

To elaborate on the terminology, we give an example in Figure 10.2. PICS detects 5 node¬ 
clusters and 3 feature-clusters in this example graph. Notice that the nodes in the same cluster 
agree on their features to a high extent, i.e. have high feature coherence; for example nodes in 
node-cluster 1 exhibit features in feature-clusters 2 and 3, nodes in node-cluster 2 exhibit features 
in feature-clusters 1 and 2, and so on. In addition, notice the nodes in the same cluster having 
similar connections among themselves as well as to the rest of the graph. For example, nodes in 
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node-cluster 2 (similarly node-cluster 3) are densely connected among themselves but scarcely 
to the rest, nodes in node-cluster 4 are densely connected to nodes in node-cluster 5 (and vice 
versa) and scarcely to the rest, and so on. 

Note that the nodes in a cluster that PICS finds may not necessarily be densely connected among 
themselves. In fact, the nodes in node-cluster 1 in the example graph are not connected to each 
other at all! They, however, share the same set of features and still have similar connectivity to 
the graph. Simply put, they are “familiar strangers”. Similarly, nodes in node-clusters 4 and 5 
are not connected within the clusters but across each other, forming a “bipartite core”. 

We would like to point out that while the traditional graph clustering algorithms would recover 
node-clusters 2 and 3, PICS can find additional type of clusters such as familiar strangers and 
bipartite cores, providing a richer analysis of a given graph. This derives from our more general 
cluster definition of nodes with “similar”, rather than “dense”, connectivity. 



(a) Original A and F 


(b) PICS A and F 


Figure 10.2: PICS on a synthetic dataset with n=900 nodes and /=180 features. Notice 5 node 
and 3 feature clusters. Nodes in the same cluster exhibit high feature homogeneity 
and have similar connectivity patterns. Note that similar connectivity does not only 
imply but also includes dense connectivity; while node-clusters 2 and 3 are densely 
connected within the clusters, node-clusters 1, 4, and 5 are not. See §10.1.1 for 
more details. 


10.1.2 Problem Formulation 

The main questions that arise given the above problem definition are: How should we decide 
the number of node and feature clusters, i.e., k and l, respectively? How can we assign the 
nodes and the features to their “proper” clusters? How much overlap of the features is “high” 
enough? We address these questions without making the users have to set any parameters such 
as the number of clusters, feature similarity thresholds or make them have to choose from a large 
collection of similarity functions. In fact, automation is exactly one of the main contributions of 
our approach. 
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Our starting point is data compression. Specifically, we want to compress two , inter-related ma¬ 
trices simultaneously. The first matrix is the n x n binary connectivity matrix A, and the second 
is the n x / binary feature matrix F (note that although we consider binary graphs in our work, 
similar ideas can also be applied to weighted graphs). Our goal then is to compress (conceptually, 
summarize) these matrices simultaneously by looking at partitions/clusters/groups (i.e. homo¬ 
geneous, rectangular “blocks”) of both low and high densities. The reason for operating on the 
matrices simultaneously is simply that the two matrices are inter-related: the node groups should 
be homogeneous in both the connectivity matrix A as well as in the feature matrix F. 

The natural question is, how many rectangular blocks should we have in each matrix A and F? 
To compress these matrices efficiently, we need to have several highly homogeneous blocks. On 
the other hand, having more clusters allows us to obtain more homogeneous blocks (at the very 
extreme, we can have nxn + nx f blocks, each having perfect homogeneity of 0 or 1). The best 
compression model should achieve a proper trade-off between these two factors of homogeneity 
(data description complexity) and the number of blocks (model description complexity). To 
achieve this goal we use the MDL principle [Griinwald, 2007], a model selection criterion based 
on lossless compression principles -we design a cost criterion that we aim to minimize in which 
costs are based on the number of bits required to transmit both the “summary” of the structure 
(model) as well as each rectangular block (data) within the structure. 

In our formulation, we address the hard-clustering problem in which each node (and each at¬ 
tribute) is assigned to a single cluster. An interesting research direction is to generalize our 
formulation for soft-clustering, where nodes could belong to multiple node-clusters. 


Notation 

Let k denote the desired number of disjoint node-clusters and let l denote the desired number 
of disjoint feature-clusters. Let R : (1, 2,..., n} {1,2,..., k} and C : (1, 2,..., /} -)• 
{1,2,...,/} denote the assignments of nodes to node-clusters and features to feature-clusters, 
respectively. We will refer to (. R , C) as a mapping. To better understand a mapping, given the 
node-clusters and feature-clusters, let us rearrange the rows and columns of the connectivity 
matrix A such that all rows corresponding to node-cluster-1 are listed first, followed by rows in 
node-cluster-2, and so on. We also rearrange the columns in the same fashion. Note that the row 
and column arrangements for A will be the same. One can imagine that such a rearrangement 
sub-divides the matrix A into k 2 two-dimensional, rectangular blocks, which we will refer to as 
= 1 

Similarly, we can rearrange the rows of the feature matrix F such that all rows corresponding 
to node-cluster-1 are listed first, followed by rows in node-cluster-2, and so on. One can reali z e 
that the row arrangements of A and F matrices will be the same. In a manner different than for 
A, we will rearrange the columns of the feature matrix F such that all columns corresponding to 
feature-cluster-1 are listed first, followed by columns in feature-cluster-2, and so on. As a result, 
we will obtain rectangular blocks denoted as Bf-, i = l..... k and j = 1..., / for F. Finally, let 
the dimensions of B? and B^j be (r t , Cj) and (r t , ry), respectively. 
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10.1.3 Objective Function Formulation 


We now describe a two-part cost criterion for the (lossless) compression of the connectivity and 
feature matrices A and F. The compression cost can be thought of as the total number of bits 
required to transmit these matrices over a network channel. The first part is the model description 
cost that consists of describing the mapping (R, C ). The second part is the data description cost 
that consists of encoding the sub-matrices (i.e., the “blocks” /j ; j ), given the mapping. Intuitively, 
a good choice of (R, C ) would simultaneously compress A and F well, and as a result yield a 
low total description cost. 

Next, we describe those two parts in more detail and then give the total encoding cost (i.e. our 
objective function). 


Model Description Cost 

This part consists of encoding the number of node and feature clusters as well as the correspond¬ 
ing mapping. 

• The number of nodes n and the number of features / (i.e., matrix dimensions) require 
log* n + log* / bits, where log* is the universal code length for integers [Rissanen, 1983]. 
This term is independent of any particular mapping. 

• The number of node and feature clusters ( k , /) require log* k + log* l bits. 

• The node and feature cluster assignments with arithmetic coding require nH(P ) + fH(Q) 
bits, where H denotes the Shannon entropy function, P is a multinomial random variable 
with the probability Pi = ^ and r, is the size of the /-th node cluster, 1 < i < k. Similarly, 
Q is another multinomial random variable with the probability q 3 = -j- and Cj is the size of 
the j-th feature cluster, 1 <j<l- 


Data Description Cost 

This part consists of encoding the matrix blocks. 

• For each block B^j, i,j — l,...,k and Bf-, i — 1,..., k, j — 1,..., l, rii(B^), that is the 
number of Is in the sub-matrix, requires log* ni(Bij) bits. 

• Having encoded the summary information about the rectangular blocks, we next encode 
the actual blocks B t] . We can calculate the density P l} (1) of Is in B tJ using the description 
code above as P { j( 1) = ni(Bij)/n(Bij), where n(Bij ) = ni(-B^) + n 0 (Bij ) = ViCj for 
F blocks (rpfj for A blocks), and n \ {B, j ) and n 0 ( B l} ) are the number of Is and Os in 
Btj , respectively. Then, the number of bits required to encode each block using arithmetic 
coding is 


E{Bij) = -n l (B ij )\og 2 (P ij (l)) -n 0 {B ij )log 2 (P ij ( 0)) = n{B ij )H{P ij { 1)). 
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Total Encoding Cost (Length in bits) 

We can now write the total encoding cost of the connectivity and feature matrices A and F, with 
respect to a particular mapping (R, C ): 

k l 

L(A,F;R,C) = log*n + log* / + log* k + log* l - ^ n log 2 (^) - ^ Cj l°g 2 (y) 

i =1 n 3 =1 ^ 

+ E E ( lo «* n i( B «) + ^5)) + E E (+ ^)) • 

i =1 j=l ' ■' i=l j=l k ' 

10.1.4 Proposed Algorithm 

The total encoding cost L(A, F;R, C) can point out the best model with the minimum cost 
among many. It does not, however, tell us how to find the best model. [Johnson et al., 2004] 
show that column-reordering for a single matrix in order to reduce the total run length of the 
matrix is NP-hard, with a reduction from the Hamiltonian path problem. Our objective function 
is different, however, related in trying to reorder rows and columns to optimize block encod¬ 
ing. Thus, we conjecture that our problem is also hard. As a result, we resort to a monotonic 
iterative heuristic solution. Our experiments show that the proposed heuristic algorithm PICS 
performs quite well in practice for the real data sets used. The pseudo-code 1 of PICS is given in 
Algorithm 1. 

Algorithm 1: PICS Algorithm 
Input: n x n link matrix A, n x f feature matrix F 

Output: A heuristic solution towards minimizing total encoding L(A, F; R, C): number 
of row and column groups (k*, l*), associated mapping (. R*, C*) 

1 Set k°=l °=1 as we start with a single node and feature cluster. 

2 Set R° := {1,2,..., n} ->• {1,1,..., 1} 

3 Set C° := {1, 2,...,/} —>• {1,1,...,1} 

4 Let T denote the outer iteration index. Set T = 0. 

5 while not converged do 

6 C T+ \ l T+1 := Split-FeatureGroup (F, C T , l T ) 

7 ( R T+1 , C T+1 ) := Shuffle (A, F, (R T , C T+1 ), (k T , / T+1 )) 

8 R T+1 ,k T+1 := Split-NodeGroup (A, F, (R T+1 , C T+1 ), (k T ,l T+1 )) 

9 ( R T+1 , C T+1 ) := Shuffle (A, F, ( R T+1 , C T+l ), (. k T+ \ l T+1 )) 

10 if L(A. F; R T+1 , C t+1 )>L{ A. F; R t , C T ) then 

11 |_ return (k*, l*)=(k T , l T ), (R*, C*)=(R T : C T ) 

12 else Set T = T + 1 

'Source code of PICS: www. cs . emu . edu/~ lakoglu/#code 
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PICS starts with a single node and a single feature cluster (Lines 1-3) and iterates between two 
steps. At each iteration, it first tries to increase the number of feature clusters l by 1, by splitting 
the feature cluster with the maximum entropy per feature into two clusters (Line 6). Then, it 
shuffles the rows and columns of A and F such that the new ordering (mapping) yields a lower 
total encoding cost for the current number of clusters (k, l ) (Line 7). Next, it tries to increase the 
number of node clusters k by 1, followed by another shuffle step (Lines 8 and 9, respectively). 
The algorithm halts when the total cost can not be reduced any further. 

The implementation details together with further explanation for the procedures Shuffle, 
Split-FeatureGroup, and Split-NodeGroup follows. 


‘Split’ procedures 

Simply put, Split-FeatureGroup (similarly Split-NodeGroup) increases the number 
of attribute (node) groups by 1 by first finding the attribute (node) group with the maximum 
entropy per-column (row) (Line 1), and then moving those attributes (nodes) in that group whose 
removal reduce the per-column (row) entropy to the new group (Lines 2-6). If no column (row) 
can be moved, the procedures return the original mapping (Lines 7-8). 


Procedure Split-FeatureGroup(F, C T , l T ) 

Input: nxf feature matrix F, C T , l T 
Output: C T+1 , l T+1 

1 Split the column group g with the maximum entropy per-column using Equation (10.1) 

g \= argmax^^ — < iai> 

' ■> i i=i J 


2 

3 


for each column y in column group g do 
j if removal of y from g decreases the per-column entropy of g as in Equ.(10.2) 


/c k 

— r E " F (AMhf ( 1)) < - E «»(!)) (U.2) 

y i=l y i=l 


4 

5 

6 

7 

8 


where B' ig denotes the B ig without column y, then 
Place y into the new group: C'J +1 = l T + 1 
Update Bf g Vi, 1 < i < k. 

else UJ+ 1 = C T y = g 

if size of new feature-group > 0 then l T+l = l T + 1 
else l T+1 = l T 
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Procedure Split-NodeGroup(A, F, R'\ C T , k T , l T ) 


Input: nxn connectivity matrix A, nxf feature matrix F, ( R T , C T ), (k T , l T ) 
Output: R T+1 , k T+1 


Split the row group g with the maximum entropy per-row using Equation (10.3) 
for each row x in row group g do 

if removal of x from g decreases the per-row entropy of g as in Equ.(10.4) then 
Place x into the new group: Rff 1 = k T + 1 
Update B gj Vj, 1 < j < l as well as and Bf g Vj, 1 < j < k. 

else R T x +l =R T x =g 


7 if size of new node-group > 0 then k T+1 = k T + 1 

8 else k T+1 = k T 


‘Shuffle’ procedure 

When the number of node or attribute groups changes, Shuffle finds a lower-cost mapping by 
reassigning the nodes and attributes to the existing groups. It does so by first iterating over all the 
rows (nodes) (Lines 4-13) and then the columns (attributes) (Lines 14-20), assigning each row 
(column) to the node (attribute) group that minimizes its encoding cost (Lines 12 and 19, resp.). 
It repeats the same process until the total cost cannot be reduced any further (Lines 21-22). 


Equations 

g := argmax^fc- ( X> F (fl«)ff(U( 1)) + Xy(B«)*(itf(i)) + £,Ab j ,mpjHw) 

1 V 3=1 3=1 3=1 j 

(10.3) 

(E « F c Ki> H M)+E (V/m)+) 

< -(E’UhwWVl 1 )) + E fn' , (B,j)^(U(l)) +” A (B ig )H(Pf g (l)))) (10.4) 

9 V 3=1 3=1 V J J 

where B' g} denotes the B gj without row x. 

f x 1 ^ \ 

R.i +1 ■= argnun^j EE nu( x R l °g pF( \ (10.5) 

^ j =1 u =0 W ' ' 

k 1 / i . i \ 

+ ( n u( X r) 1°S + U u ( X c) lo S W4EE ) + (!) + logP4(i) - logF’f (1)) 

j= 1 u=o v A?* W Uj W / 

+(1 - rfxx)(logf^,(°) + logP^O) - logftf(0))} 
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Procedure Shuffle(A, F, R T , C T , k T , l T ) 

Input: nxn connectivity matrix A, nxf feature matrix F, ( R T , C T ), ( k T , l T ) 

Output: (R T+1 , C T+1 ) 

1 Let t denote the inner iteration index. Set t = T 

2 Compute B\- and Pk with respect to (/?/, C l , k 1 , l 1 ) 

3 while not converged do 

4 Shuffle rows : Fix column assignments C l 

5 for eac/z row x do 

6 Splice x in F into l 1 parts (each corresponding to one of the column groups in F) 
Denote them as x 1 ,... ,x l *. 

7 for each of these parts do 

8 |_ Compute the number of Is and Os, that is, nff (x j ), u = 0,1 and j — l,... ,l t 

9 Splice a; in A and the corresponding column in A into k' parts. Denote them as 
xf ..., xf and xf... ,xf, respectively. 

10 for each of these parts do 

11 Compute the number of Is and Os, that is, n^(x 3 r ) and n^(x J c ), u = 0,1 and 
\_j = l,...,k t 

12 Assign each row (i.e., node) x into the node group Rff 1 that yields the minimum 
encoding cost for x using Equation (10.5) 

13 Re-compute Bj+ l and Pfk 1 with respect to (R t+1 , kf, f) 

14 Shuffle columns'. Fix row assignments R t+1 in A and F, (also fixes the column 
assignment in A, so we operate only on F here) 

15 for each column y do 

16 Splice y in F into kf parts (each corresponding to one of the row groups in F) 
Denote them as y 1 ,..., y k *. 

17 for each of these parts do 

18 |_ Compute the number of Is and Os, that is, (y l ), u = 0,1 and i — 1,..., k* 

19 Assign each column (i.e., feature) y to the feature group 0'+ 1 that yields the minimum 
encoding cost for y using Equation (10.6) 

20 Recompute B {+ 1 and Pl +1 w.r.t. (R t+1 , C t+ \ kf If 

21 if there is no decrease in total cost then return ( /?', Cf 

22 else Set t — t + 1 


where first two lines respectively denote the cost of shifting row x in F, and the same row x and 
its corresponding column in A to a new group i. Last two lines account for the double-counting 
of cell d xx in A. 
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Figure 10.3: Step-by-step operation of PICS on the synthetic attributed graph in Figure 10.2. 


Notice that the PICS algorithm employs a top-down approach in finding the clusterings, where 
we start with a single node and a single attribute cluster and split into more clusters over itera¬ 
tions. In light of this, an interesting future direction is to study a bottom-up approach; starting 
with n node and / attribute clusters and merge into fewer clusters over iterations. 

Synthetic Graph Example 

In Figure 10.3, we show the step-by-step operation of PICS on the synthetic dataset in 
Fig. 10.2. In Fig. 10.3(a), we see the output A and F after Split-FeatureGroup and 
Shuffle (Steps 6 and 7) are executed on the original A and F shown in Fig. 10.2(a). Here, 
Split-FeatureGroup increases the number of feature groups to 2, and then Shuffle re¬ 
orders the rows and columns in both matrices. Next, Split-NodeGroup increases the number 
of row groups to 2 (Step 8) and Shuffle reorders the rows and columns that yields a lower en¬ 
coding cost (Step 9). This is also visually clear in Fig. 10.3(b). PICS repeats the same steps in 
Fig. 10.3(c) and (d). Notice that Split-FeatureGroup cannot increase the number of fea¬ 
ture groups above 3, and thus Shuffle is called only after Split-NodeGroup in Fig. 10.3(e) 
and (f), after which Split-NodeGroup also stops finding new node groups for reduced cost 
and the algorithm converges. 

Convergence 

The stopping criterion for Shuffle is satisfied if and only if the total encoding cost cannot be 
reduced any further by the new ordering. Therefore, lines 7 and 9 in Algorithm 1 decreases the 
objective criterion L(A, F; R. C ). Since the objective criterion, i.e. the total encoding cost in 
bits, has the lower bound zero and the number of node and feature clusters have respective upper 
bounds n and /, the algorithm is guaranteed to converge. 
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Computational Complexity 


The computationally most demanding component of PICS is the Shuffle procedure, which 
takes as input the A and F matrices and the number of clusters (k, /), and finds a new ordering 
of the rows and columns that gives a lower encoding cost. It achieves this goal by iterating 
between two steps: (1) shuffling the columns in F, and (2) shuffling the rows in A and F, 
simultaneously. 

One iteration of step (1) above is 0{rii(F) * /), where ni(F) denotes the number of non-zeros 
in F, as we access each column and compute its number of non-zeros and consider l possible 
feature groups to place it into. Similarly, one iteration of step (2) is 0((ni(F) + 2ni(A)) * k). 
Therefore, the total complexity of Shuffle is 0([2ni(A)k + n\ (F)(k + /)] * t ), where t is the 
total number of inner iterations for Shuf f le. 

PICS calls Shuffle two times at each outer iteration T (Lines 7 and 9 in Algorithm 1). T is 
equal to ma x(k*,l*). Thus, the overall complexity of PICS is 0(max(k*,1*) * [: 2ni(A)k * + 
n\(F)(k* + /*)]* t), where t denotes the maximum number of inner iterations. Note that 
PICS scales linearly with respect to the total number of non-zeros in A and F, the total number 
of Shuffle iterations t (typically t <k ni(A) + n, (F)), and quadratically with respect to the 
number of clusters ( k *, l*), where k* and l* are usually small. 


10.2 Experimental Results 

10.2.1 Datasets 


In our experiments we studied six real-world datasets from various domains which we describe 
next. A summary is given in Table 10.1. In each case we are able to use PICS to automatically 
discover interesting node and attribute structures, with further investigation providing explana¬ 
tions for their presence. 


Dataset 

n 

/ 

ni(A) + ni(F) 

Connectivity and Attribute Description 

YOUTUBE 

77381 

30087 

994542 

User friendships and group memberships 

TWITTER 

9654 

10000 

81770 

User ‘mention’s and hashtag usages 

CALL 

94 

7 

391 

Phone calls and affiliations 

DEVICE 

94 

7 

5233 

Bluetooth device scans and affiliations 

POLBOOKS 

92 

2 

840 

Book co-purchases and politic inclinations 

POLBLOGS 

1490 

2 

20580 

Blog citations and political inclinations 


Table 10.1: Dataset summary, n: number of nodes, /: number of features, v.\ (A) + ri \ (F): 

number of non-zeros (nnz), i.e. edges, in the connectivity matrix A plus the nnz in 
the feature matrix F. See S 10.2.1 for more details. 
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YOUTUBE data consists of users and YouTube-groups. The links represent user-user friend¬ 
ships and node attributes are the group memberships of each user. The data is compiled by [Mis- 
love et al., 2007] in 2007. 

TWITTER graph contains early-adopter users, with links representing the who-mentioned- 
whom interactions between them. The attributes are the hashtags (tokens starting with a #) that 
the users used in their messages (Tweets). 

The CALL and DEVICE graphs are constructed using the Reality Mining data sets provided by 
the MIT Media Lab [Eagle et al., 2007]. The Reality Mining project was conducted in 2004-05 
with 94 human subjects using mobile phones pre-installed with special software that recorded 
data. CALL is built using the call logs, and represents the phone-call interactions between the 
subjects. DEVICE is constructed using Bluetooth device scans; an edge from i to j indicates 
person V s device discovered person j’s device within five meters proximity. Subjects included 67 
students (1st year grad, grad (not 1st year), 1st year undergrad, and undergrad (not 1st year)), 
faculty, and staff working in the Media Lab, and 27 business students at the university’s Sloan 
School of Management. 

POLBOOKS is a graph of books about U.S. politics published during 2004 presidential elec¬ 
tion [http : / / www-personal. umich . edu/ ~me jn/ net data/]; an edge from i to j in¬ 
dicates that book i was frequently co-purchased with book j by the same buyers. Similarly, 
POLBLOGS is a directed graph of hyperlinks between web-blogs on U.S. politics, compiled 
by [Adamic and Glance, 2005] in 2005. In both graphs, the nodes (books or blogs) are attributed 
as being liberal or conservative. 

The YOUTUBE and POLBOOKS graphs are undirected and the rest are directed. 

10.2.2 Clusters, bridges, and outliers 

Next, as for our real datasets no ground truth (if any) for “true” clustering exists, we provide 
anecdotal and visual study. 


YOUTUBE Dataset: 

PICS finds node and feature clusters of various sizes in our largest dataset YOUTUBE, as 
shown in Figure 10.1, § 1. In Table 10.2, we show example YouTube-groups in major feature 
clusters; notice that these clusters can be human-labeled as ‘porn’, ‘music’, ‘anime’, and ‘special 
interest’. 

With respect to node clusters, PICS finds a very sparse group of ‘familiar strangers’ in cluster 
‘1’ with high feature coherence but scarce connections to the rest of the graph. These users 
belong to arbitrarily many and mostly same YouTube-groups labeled as ‘Al’, yet are not well- 
connected among themselves. Node clusters in the blue square labeled as ‘2’ exhibit dense 
connectivity among each other as well as high feature homogeneity as seen in the ‘anime’ groups 
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labeled as ‘A2\ The node clusters in the blue square labeled as ‘4’ mostly belong to YouTube 
groups associated with ‘porn and music’ labeled as ‘A4\ Notice that the nodes in each of these 
clusters have quite similar connectivity to the graph. The node cluster ‘3’ contains nodes with 
connections across many clusters, behaving l ik e ‘bridges’. They also mostly belong to the same 
YouTube-groups, labeled as ‘A3’. Finally, the small node clusters constitute the outliers, with 
arbitrarily many connections across clusters. All in all, by using PICS we are able to understand 
and summarize the YOUTUBE graph in a completely unsupervised fashion. 


porn 

music 

anime 

interest 

hotmodels 

poptastic 

anime4ever 

streetboarders 

up skirt 

raphiphop 

narutoholics 

modelcooks 

lesbokiss 

guitarsolos 

crazyforanime 

poetry andmusic 

men4men 

metallovers 

animefreakl 

mexicogrupero 

gayestgay 

classicalmusic 

AnimeDaisuki 

bodypainting 

bootyshake 

xtinaaguilera 

SailorMoon 

nflfans 

sexysolo 

heavymetal 

Tsyukomi 

chelseafcfans 


Table 10.2: Examples of YouTube-groups in feature clusters found by PICS. See Figure 10.1. 


TWITTER Dataset: 

Our Twitter dataset consists of directed edges (i,j) between users if i directs a message to j at 
least three times. Attributes in our network are the hashtags used at least three times by a user. 
In Figure 10.4, PICS finds a group of ‘casual users’ in node cluster T’ with few connections 
and few number of hastags used. Node cluster ‘2’ is the most prominent: this is a dense group 
of users (mostly tech bloggers) all from Italy who extensively mention each other but do not 
frequently message other users. They also use common and distinctive hashtags in Italian such 
as ‘#terremoto’ and ‘#partigi’. The two clusters in blue square (labeled as ‘4’) are the so-called 
‘heavy-hitters’ who use overwhelmingly many different hashtags (labeled as ‘A4’). PICS also 
reveals a ‘core-periphery’ pattern within these users. The clusters in blue square labeled as ‘5’ 
form dense diagonal blocks with dense connections within. The smaller clusters also correspond 
to the so-called ‘bridges’, with many connections across clusters. Notice that the bridge-nodes 
are mostly mentioned by others but do not themselves mention others. Upon further inspection 
of users in this group, we discover Jeffrey Zeldman, a well known author, as well as Jack Dorsey 
and Ev Williams, the two founders of Twitter. We also observe that users in groups labeled as 
‘3’ and ‘5’ have never used the hashtags in the rightmost feature cluster of Figure 10.4. 


CALL Dataset: 

PICS finds 3 major node clusters as shown in Figure 10.5. The first cluster corresponds to the 
group of casual subjects who make phone calls to only a few people. The next two clusters, 
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Twitter users 



Figure 10.4: PICS on TWITTER finds a tight group of users from Italy and reveals groups 
of casual users, heavy-hitters, and bridges. Left: user-mentions-user adjacency 
matrix; right: users-to-hashtags feature matrix. 


which are relatively densely connected, are a group of business and a group of grad students, 
respectively. 

With respect to outliers, notice a cluster of size 1, an outlier subject who does not belong to any 
of the clusters. This subject, whose affiliation is not given in the dataset, does not make any 
outgoing calls but receives calls from almost everyone, which is presumably a call service center 
in the campus. Another outlier we can spot is a 1st year grad student who is in the same cluster 
with the business students in cluster 2; he/she neither calls nor gets called by other grad students 
but some business students. 

With respect to bridges, we notice one grad student in cluster 3 who is in mutual contact with 
two business students. Notice that none of the other subjects has phone call interactions with the 
business students. 


DEVICE Dataset: 

In Figure 10.6, we observe 3 major dense clusters; clusters 1, 3 and 5. These three clusters involve 
mostly the grad , business and undergrad students, respectively. The reason these clusters form 
near-cliques, i.e. are almost fully connected, may be due to the Bluetooth scans occurring when 
these groups of students sit in the same classroom all within five meters. 

In this dataset, the sparser clusters are also of interest. For instance, cluster 2 is a group of 
subjects whose devices seem to report far less scan results. This might be due to powered-off 
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Figure 10.5: PICS on CALL. Notice 3 major node groups of casual users, business and grad 
students, respectively as well as a group of size 1, receiving many calls, probably a 
call service center. 


devices or turned-off Bluetooth functionality. 

In cluster 3 of business students, we also notice a column consisting of almost all zeros. This 
corresponds to a subject whose device can scan other devices, but somehow it does not get 
detected by others. This may be due to a malfunctioning device or missing data. 


POLBOOKS Dataset: 


The POLBOOKS dataset consists of liberal and conservative books which might be thought as 
two major clusters. In Figure 10.7, PICS gives more information about the cluster structure by 
finding 4 node clusters. The denser clusters (clusters 2 and 4) correspond to the “core” conserva¬ 
tive and liberal books, respectively, which are often purchased together. Clusters 1 and 3 are then 
the corresponding “peripheral” books. Table 10.3 gives a list of several books in each “core”. 
Notice that the “core” books do seem to lie at the two extremes of the political spectrum. 

In cluster 1, we also observe 3 bridging books, namely Bush at War , The Bushes: Portrait of a 
Dynasty , and Rise of the Vulcans: The History of Bush’s War Cabinet , which are co-purchased 
with both some liberal and conservative books. These books are human-labeled as conservative 
although they seem to have more of a historical perspective. Notice that the bridge nodes reside 
in the “periphery” rather than in the “core”. 
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Feature Groups 

Figure 10.6: PICS on DEVICE finds 3 major dense node groups of grad, business and under¬ 
grad students, respectively, as well as anomalies, probably missing data. 



Figure 10.7: PICS on POLBOOKS finds 4 node groups corresponding to “core” and “periph¬ 
eral” liberal and conservative books, as well as several bridge-books, with historical 
content and not extremely liberal or conservative. 


POLBLOGS Dataset: 

In Figure 10.8, we observe 7 major node clusters in POLBLOGS. The first and the largest cluster 
contains liberal and conservative blogs which do not have many citations. Clusters 2-4 consist 
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Liberal 

Conservative 

-Lies and the Lying Liars Who Tell Them: A 
Fair and Balanced Look at the Right 
-Big Lies: The Right-Wing Propaganda Ma¬ 
chine and How It Distorts the Truth 
-The Lies of George W. Bush 
-Dude, Where’s My Country? 

-Worse Than Watergate: The Secret Presi¬ 
dency of George W. Bush 
-Thieves in High Places: They’ve Stolen Our 
Country and It’s Time to Take It Back 

-Persecution: How Liberals Are Waging War 
Against Christianity 

-Deliver Us from Evil: Defeating Terrorism, 
Despotism, and Liberalism 
-Tales from the Left Coast 
-A National Party No More 
-Bush Country: How George W. Bush Be¬ 
came the First Great Leader of 21st Century 
-Losing Bin Laden: How Bill Clinton’s Fail¬ 
ures Unleashed Global Terror 


Table 10.3: Examples of “core” liberal and conservative books. 


of conservative and 5-7 consist of liberal blogs. Here, PICS seems to also reveal the “core” and 
“periphery” structure for the political blogs. In particular, cluster 3 is a core conservative group 
with a fanatic follower group (cluster 4), and a less fanatic follower group (cluster 2) of other 
conservative blogs, which often cite the blogs in cluster 3. Examples to “core” conservative blogs 
include rightwingnews.com, georgewbush.com, and conservativeeyes.blogspot.com. Similarly, 
cluster 6 is a core liberal group with cluster 7 being its more fanatic, and cluster 5 being its less 
fanatic follower group of other liberal blogs, with many citations to cluster 6. The blogs in the 
liberal “core” include talkleft.com, liberaloasis.com, and democrats.org/blog. 


2 3 4 5 6 7 




conservative liberal 
Feature Groups 


Figure 10.8: PICS on POLBLOGS. Notice the “core” conservative and liberal blogs (clusters 
3 and 6), each with two sets of “peripheral” groups with many citations. 
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Figure 10.9: Run time of PICS versus the total number of non-zeros (nnz) in A and F (averaged 
over 10 runs, bars depict ± one standard deviation). The numbers in parentheses 
denote the number of node and feature clusters ( k*,l*) found. Notice that the run 
time grows linearly w.r.t. total nnz. 


10.2.3 Scalability 

In § 22 we theoretically showed that the computational complexity of PICS is linear in the total 
graph and attribute size. In this section, we also demonstrate the time complexity of PICS ex¬ 
perimentally. Figure 13.5 shows the running time with respect to increasing total size for several 
datasets we studied (the total size is the total number of non-zeros (nnz) in the A and F matrices). 
Notice that the run time grows linearly with respect to the total nnz. (Recall that the run time also 
depends on the number of clusters found, hence the slight dip for YOUTUBE at total size 950K 
as fewer clusters (20 vs 23) are found.) Experiments were performed on a 4-CPU Intel 3.0GHz 
Xeon server with 16GB RAM. All code was written in Matlab. 
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10.3 Summary of contributions 


In this chapter, we introduced PICS 2 , a parameter-free method that finds cohesive clusters and 
hence helps us spot outliers in attributed graphs. The contributions of our work include: 

• Algorithm design: We introduce a novel clustering model; PICS finds groups of nodes 
in an attributed graph with (1) similar connectivity , and (2) attribute homogeneity. By 
definition, clusters with similar connectivity include but are not limited to dense clusters. 
It also groups the node attributes into meaningful clusters. The nodes deviating from the 
discovered patterns correspond to bridge-nodes with connections across clusters or outlier- 
nodes that do not belong well to any cluster. 

• Parameter-free nature: PICS is fully automatic. It works without any user-specified input, 
such as the number of clusters, choice of density or similarity functions and thresholds. To 
the best of our knowledge, no other work has been proposed to mine attributed graphs in a 
parameter-free fashion. 

• Scalability: The run time of the proposed algorithms grows linearly with respect to the 
total graph and attribute size. 

• Effectiveness: We show that PICS discovers quality clusters, bridges and outliers in di¬ 
verse real-world datasets including YouTube and Twitter. 

There are a number of interesting future directions for PICS, such as extending the algorithms to 
time-evolving attributed graphs by dynamically updating the mapping, and to more general cases 
such as soft-clustering, where the node, and similarly attribute, clusters may overlap. 


2 Source code of PICS: www. cs . emu . edu/~ lakoglu/#code 
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Chapter 11 

Anomalies in Categoric-attributed Data 


PROBLEM STATEMENT: Given a database with categorical attributes, how can we find mean¬ 
ingful patterns that succinctly describe the data and spot outliers? 


In this chapter, we address the problem of anomaly detection in multi-dimensional databases 
using pattern-based compression. Compression-based techniques have been explored mostly in 
communications theory, for reducing transmission cost and increased throughput as well as in the 
field of databases, for reduced storage cost and increased query performance. Here, we improve 
over recent work that identified compression as a natural tool for spotting anomalies [Smets and 
Vreeken, 2011]. Simply put, we define the norm by the patterns that compress the data well. 
Then, any data point that can not be compressed well is said not to comply with the norm, and 
thus can be flagged as abnormal. 

The heart of our method, which we call COMPREX, is to use a collection of dictionaries to 
encode a given database. We exploit correlations between the features in the database, grouping 
features with high information gain together, and build dictionaries (a.k.a. look-up tables or code 
tables [Vreeken et al., 2011]) for each group of strongly interacting features. 

Informally, these dictionaries capture the data distribution in terms of patterns; the more often a 
pattern occurs, the shorter its encoded length. The goal is to find the optimal set of dictionaries, 
those yield the minimal lossless compression, and then spot the tuples with long encoded lengths; 
those described using relatively infrequent patterns. 

Dictionary based compression for anomaly detection was first studied in [Smets and Vreeken, 
2011], employing the Krimp itemset-based compressor introduced by Siebes et al. [Siebes et al., 
2006; Vreeken et al., 2011]. Besides high performance, it allows for characterization: one can 
easily inspect how tuples are encoded, and hence why one is deemed an anomaly. Where Krimp 
builds a single code table, we use multiple code tables to describe the data. As such, our method 
can better exploit correlations between groups of features, as by doing away with uncorrelated 
features it can locally assign codes more efficiently—leading to both better compression and 
higher accuracy. 
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Furthermore, unlike Krimp, we build code tables directly. That is, instead of filtering very large 
collections of pre-mined candidate patterns, we construct our code tables in a bottom-up fashion. 
We iteratively join those feature groups by which most compression can be gained, having only 
to consider patterns over the two groups. As such, CompreX is much more efficient, with 
running time growing only linearly with number of features and database size. 

Most importantly, by not requiring the user to provide a collection of candidate itemsets, nor a 
minimal support threshold, CompreX is parameter free in both theory and practice. We employ 
the Minimum Description Length (MDL) principle to automatically decide the number of feature 
groups, which features to group, what patterns to include in the corresponding code tables, as 
well as to point out anomalies. Most existing anomaly detection methods, on the other hand, have 
several parameters, such as the choice of a similarity function, density or distance thresholds, 
number of nearest neighbors, etc. 

In a nutshell, we improve over the state of the art by encoding data using multiple code tables, 
instead of one—allowing us to better grasp strongly interacting features. Moreover, we build our 
models directly from data—avoiding the expensive step of mining and filtering large collections 
of candidate patterns. 

Experiments show that the resulting models obtain high performance in anomaly detection; im¬ 
proving greatly over the state of the art for categorical data. It is more generally applicable, 
though; even after discretization it matches the state of the art for numerical data, and correctly 
identifies anomalies both in translated large graphs and image data. 


11.1 CompreX: Method Description 


We first describe how dictionary-based compression works, where we introduce the preliminaries 
and notation. Next, we present our method and the proposed algorithms. 


11.1.1 Dictionary-based Compression 

Notation and Preliminaries 

In this work we consider categorical databases. A database D is a bag of n tuples over a set 
of m categorical features T = {/j, ..., f m }. Each feature / G T has a domain dom(f) of 
possible values {vi,v 2 , ■ ■ .}. The number of values v G dom(f) is the arity of /, i.e. arity(f) = 
\dom(f)\ G N . 

The domains are distinct between features. That is, dom(fi) fl dom(fj) = 0, Vi ^ j. The domain 
of a feature set F C T is the Cartesian product of the domains of the individual features / G F. 
i.e., dom(F) = [I/gF dom(f). 
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A database D is simply a collection of n tuples, where each tuple t is a vector of length m 
containing a value for each feature in T. As such, D can also be regarded as a n- by -m matrix, 
where the possible values in a column i are determined by dom(fi). 

An item is a feature-value pair (/ = v), with /CJ, and v G dom(f). A itemset is then a pair 
(F = v), for a set of features F C F, and v G dom(F) is a vector of length F|. We typically 
refer to an itemset as a pattern. 

A tuple t is said to contain a pattern (F = v ), denoted as p(F = n) C t (or p C t for short), 
if for all features f E F, tf — Vf holds. The support of a pattern (F = v) is the number of 
tuples in D that contain it: supp(F — v) = \{t G D | (F = v) C f}|. Its frequency is then 
freq(F = v) = supp(F = v)/\D\. 

Finally, the entropy of a feature set F is defined as 

H(F) = — ^ freq{F = v ) log freq(F = v ) . 

vGdom(F) 

All logarithms are to base 2, and by convention, 0 log 0 = 0. 

In this work we take a pattern based approach to encode a given database, which we explain in 
detail next. 


Data encoding 

As models, we will use code tables. A code table is a simple two column table. The first 
column contains patterns,, which are ordered descending (1) first by length and (2) second by 
support. The second column contains the code words code(p ) corresponding to each pattern p. 
An illustrative database with 6 tuples and an example code table for the database is illustrated in 
Table 11.1. 


Data 

Code Table 

usage (p) 

L(code(p )) 

/ 1 / 2/3 

p(F = v ) 

code(p) 

a b x 

a b x 

0 

4 

1 bit 

a b x 

a c 

10 

2 

2 bits 

a b x 

X 

110 

1 

3 bits 

a b x 

y 

111 

1 

3 bits 

a c x 





a c y 






Table 11.1: An illustrative database D and an example code table CT for a set of three features, 

F={/i,/ 2 ,/ 3 }. 


The code words in the second column of a code table CT are not important: their lengths are. 
The length of a code word for a pattern depends on the database we want to compress. Intuitively, 
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the more often a pattern occurs in the database, the shorter its code should be. The usage of a 
pattern p e CT is the number of tuples t e D which contain p in their cover, i.e. have the code 
of p in their encoding. 

The encoding of a tuple t, given a CT, works as follows: the patterns in the first column are 
scanned in their predefined order to find the first pattern p for which p C t. p is then said to be 
used in the cover of t , and the corresponding code word for p in the second column becomes 
part of the encoding of t. If t \ p ^ 0, the encoding continues with t\p until t is completely 
covered, which yields a unique set of patterns that form the cover of t. 

Given the usages of the patterns in a CT, we can compute the lengths of the code words for the 
optimal prefix code [Griinwald, 2007]. The Shannon entropy gives the length for the optimal 
prefix code for p as 

L(code(p) | CT) = — log ( 

2^ usage(p) 

p’eCT 

The number of bits to encode a tuple t is simply the sum of the code lengths of the patterns in its 
cover, that is, 

L(t | CT) = J2 L(code(p) \ CT). 

p£cover(t) 

The total length of the encoded database is then the sum of encoded data tuple lengths, 

L(D | CT) = I CT). 

t£D 


Model encoding 

To find the MDL-optimal compressor, we also need to determine the encoded size of the model, 
the code table in our case. Clearly, the size of the second column in a given code table CT that 
contains the prefix code words code{p) is trivial; it is simply the sum of their lengths. For the 
size of the first column, we need to consider all the singleton items X contained in the patterns, 
i.e., X = U feJ rdom(f). 

For encoding the patterns in the left hand column we again use an optimal prefix code. We first 
compute the frequency of their appearance in the first column, and then by Shannon entropy 
calculate the optimal length of these codes. Specifically, the encoding of the first column in 
a code table requires cH(P) bits, where c is the total count of singleton items in the patterns 
p € CT, H(.) denotes the Shannon entropy function, and P is a multinomial random variable 
with the probability pi = — in which r l is the number of occurrences of singleton item i e X 
in the first column (Note that for the actual items, one could add an ASCII table providing the 
matching from the prefix codes to the original names. Since all such tables are over X, this only 
adds an additive constant to the total cost). All in all, 

L(CT) = L(code(p) \ CT) + £- n log (pi). 

p&CT i£l 
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11.1.2 Compression with Sets of Code Tables: the Theory 


In the previous section, we showed how to encode a database D using a single code table CT. 
In fact, this is the approach introduced in [Siebes et al., 2006] to compress transaction databases, 
employing frequent itemset mining to generate the candidate patterns for the code table. Here, 
we do not regard transaction data, but regular relational data, where tuples are data points in 
a multi-dimensional categorical feature space. In this space, some groups of features may be 
highly correlated; and hence may be compressed well together. 

As a result, we can improve by using multiple code tables, as we can then exploit correlations, 
and build a separate CT for each highly correlated group of features. As such, using multiple , 
that is a set of code tables, and mining these efficiently, are two of the main contributions of 
our work. Next, we formally introduce the concept of feature partitioning, and give our problem 
statement. 

Definition 7 A feature partitioning V = { F \,..., Fj .} of a set of features T is a grouping of T, 
for which (1) each partition contains one or more features: V F, e V : F ( f 0, (2) all partitions 
are pairwise disjoint: Vi f j : F, fl F } = 0, and (3) every feature belongs to a partition: 

U Fi = r. 

Formal Problem Statement 1 Given a set of n data tuples in D over a set of m features in 
T , find a partitioning V : {F 1 ,F 2 ,.... F k ) of T and a set of associated code tables CT : 
{ CT F] , CTj ? 2 ...., CT Fk ], such that the total compression cost in bits given below is minimized. 

L{V, CT , D)=L(V) + Y. l( Tf{D)\CT f ) + ^ L(CT F ) , 

Fev Fev 

in which tt F/ (I)) denotes the projection of D on feature subspace F,. 

Note that the number of features m and the number of tuples n are fixed over all models we 
consider for a D, and hence are a constant additive that we can safely ignore. 


Partition encoding 

The first term of L(fP, CT, D) denotes the length of encoding the partitioning, which consists of 
two parts; encoding (a) the number of partitions and (b) the features per partition. 

(a) Encoding the number of partitions: First, we need to encode k, the number of partitions 
and code tables. For this, we use the MDL-optimal encoding of an integer [Griinwald, 2007]. 
The cost for encoding an integer value k is L°(k) = log*(k ) + log(c), with c = 2~ log ^ n) ~ 
2.865064,, and log *(k) = log (k) + loglog(/c) + • • • sums over all positive terms. Note that as 
log(c) is constant for all models, we ignore it. 

(b) Encoding the features per partition: Then, for each feature we have to describe to which 
partition it belongs. This we do using m log(/c) bits. 

In summary, L( V) = log*(/c) + mlog(£;). 
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Data encoding 


The second term of L(V, CT, D ) denotes the cost of encoding the data with the given set of code 
tables. To do so, each tuple is partitioned according to V, and encoded using the optimal codes 
in the corresponding code tables following the procedure described in §11.1.1. 


Model encoding 

The last term of L(V,CT,D ) denotes the model cost, that is the total length of encoding the 
code tables. Each code table is encoded following the procedure described in §11.1.1. 

Note that the number of feature groups k is not a parameter of our method but rather is determined 
by MDL. In particular, MDL ensures that we will not have two separate code tables for a pair of 
highly correlated feature groups as it would yield lower data cost to encode them together. On 
the other hand, combining feature groups may yield larger code tables, that is higher model cost, 
which may not compensate for the savings from the data cost. In other words, we group features 
for which the total encoding cost L(V, CT, D ) is reduced. Basically, we employ MDL to guide 
us both in finding which features to group, and in choosing the number of groups to have. 


11.1.3 Mining Sets of Code Tables: the Algorithm 

Having defined the cost function as L(V, CT, D ), we need an algorithm to search for the best set 
of code tables CT for the optimal vertical partitioning V of the data such that the total encoded 
size L(V, CT, D ) is minimized. 

The search space for finding the best code table for a given set of features, yet alone for finding 
the optimal partitioning of features, however, is quite large. Finding the optimal code table for 
a set of \F t \ features involves finding all the possible patterns with different value combinations 
up to length \F t \ and choosing a subset of those patterns that would yield the minimum total cost 
on the database induced on F t . Furthermore, the number of possible partitioning of a set of m 
features is the well-known Bell number B m . 

While the search space is prohibitively large, it neither has a structure nor exhibits monotonicity 
properties which could help us in pruning. As a result, we resort to heuristics. Our approach 
builds the set of code tables in a greedy bottom-up, iterative fashion. We give the pseudo-code 
as Algorithm 2, and explain it in more detail in the following. 

CompreX starts with a partitioning V in which each feature belongs to its own group (1), and 
separate, elementary code tables CT for each feature f associated with the feature sets (2). 


Definition 8 An elementary code table CT encodes a database D induced on a single feature 
f G T. The patterns p G CT consist of all length-1 unique items iq,... ,v arit y(f) in dorn(f'). 
Fincdly, usage(p G CT) = freq(f = v). 
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Typically, some features of the data will be more strongly correlated than others, e.g., the age of 
a car and its fuel efficiency, or the weather temperature and flu outbreaks. In such cases, it will 
be worthwhile to represent features for which the correlation is ‘high enough’ together within 
one CT, as we can then exploit correlation to save bits. 

More formally, we know from Information Theory that given two (sets of) random variables (in 
our case feature groups) F, \ and F :r the average number of bits we can save when compressing 
Fj and F t together instead of separately, is known as their Information Gain. That is, 


IG(Fi, Fj) = H(Fi) + H(Fj) - H(F U Fj) > 0 , 


In fact, the IG of two (sets of) variables is always non-negative (zero when the variables are 
independent from each other), which implies that the data cost would be the smallest if all the 
features were represented by a single CT. On the other hand, our objective function also includes 
the compression cost of the CT s. 

Clearly, having one large CT over many (possibly uncorrelated) features will typically require 
more bits in model cost than it saves in data cost. Therefore, we can use IG to point out good 
merge candidates, subsequently employing MDL to decide whether the total cost is reduced, and 
hence, whether to approve the merge or not. 

The first step then is to compute the IG matrix for all pairs of the current feature sets, which 
is a non-negative and symmetric matrix (3). Let \F t \ denote the cardinality, i.e. the number of 
features in the feature set F,. We sort the pairs of feature sets in decreasing order of IG- per- 
feature, i.e. normalized by their total cardinality, and start outer iterations to go over these pairs 
as the candidate CT s to be merged, say CT, and CTj (5). The starting cost cost init is set to the 
total cost with the initial set of CTs (6). 

The construction of the new C'TG then works as follows: we put all the existing patterns 
Pi,i, ■ ■ ■ j Pi,m and \ ,. .. ,Pj, nj from both CTs into the new CT (7). Following the conven¬ 
tion, they are sorted first by length (from longer to shorter) and second by usage (from higher 
to lower) (8). Candidate partitioning V is built by dropping feature sets F % and Fj from V and 
including the concatenated set F t p (9). Similarly, we build a temporary code table set CT by 
dropping the candidate tables CT and CTj and adding the new CT,\ :j (10). 

Next, we find all the unique rows of the database induced on fyu (11). These patterns of length 
(| F, |+1 Gjj) are sorted in decreasing order of their occurrence in the database and constitute the 
candidates to be inserted into the new CT. Let Pi\jp, ■ ■ ■ . p l \ J . n , . denote these patterns of the 
combined feature set F t p in their sorted order of frequency. In our inner iterations (12), we insert 
these one-by-one (13), update (i.e. decrease) the usages of the existing overlapping patterns 
(14), remove those patterns whose usage drops to zero (15), recompute the code word lengths 
with the updated usages (16) and compute the total cost after each insertion. If the total cost is 
reduced, we store the candidate partitioning V and the associated set of code tables CT (18), 
otherwise we continue insertions (from 12) with the next candidate patterns for possible future 
cost reduction. 
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In the outer iterations, if total cost is reduced the IG between the new feature set F,u and the rest 
are computed (20). Otherwise the merge is rejected and the candidates V and CT are discarded 
(22). Next the algorithm continues to search for future merges, and the search terminates when 
there are no more pairs of feature sets that can be merged for reduced cost. 

Algorithm 2: CompreX Algorithm 
Input: Database D with n tuples and m (categorical) features 
Output: A heuristic solution to Problem Statement 1: a feature partitioning 

V : {.Fj,..., F k }, associated set of code tables CT : {CT X ,..., CT k }, and total 
encoded size L(T, CT, D) 

1 V E- {Fj,..., F m }, Fi = {/,:}, 1 < i < m 

2 CT E- {CT \,..., CT m \, where each CT is elementary 

3 Compute IG between V(Fj, Fj) eV,i>j 

4 while no convergence, i.e. more merges do 

5 for each (Fi, Fj) E V in decreasing normalized IG do 

6 Compute costinit e- L(T,CT,D) 

7 Put patterns p E CT and p E CTj into a new CT^j 

8 Sort p G CT % \ v (1) by length and (2) by usage 

9 V <— V\(Fi U Fj) U Fi\j 

to CT <— CT\(CT U CTj ) U CTj\j 

11 Find unique rows (candidate patterns) p. t \j in D FtU 

12 for each unique row Pi\j, x in decreasing frequency do 

13 Insert p^ to new CT^ 

14 Decrease usages of overlapping patterns p G CT^ and p G cover (p^T by 
freq(pi\ jtX ) 

15 Remove patterns p G CT^j with usage(p )=0 

16 Recompute L(code(p G CT^)) with new usages 

17 if L(V, CT, D ) < L(V, CT, D) then 

18 ^V^VmdCT^- CT 

19 if L(V, CT, D) < costinit then 

20 |_ Compute IG between F t \ 3 and MF X eV,F x yI F t j 

21 else 

22 Discard V and CT 


11.1.4 Computational Speedup and Complexity 

In Algorithm 2, computationally most demanding steps are (1) finding all the unique rows in the 
database under a particular feature subspace when two feature sets are to be merged (11) and (2) 
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after each insertion of a new pattern to the code table, finding the existing overlapping patterns 
the usages of which to be decreased (14). 

With a naive implementation, step (1) is performed on the fly scanning the entire database once 
and possibly using many linear scans and comparisons over the unique rows found so far in the 
process. Furthermore, step (2) requires a linear scan over the current patterns in the new code 
table CT,y for each new insertion. The total computational complexity of these linear searches 
depends on the database, however, with the outer and inner iteration levels (Lines 5 and 12, 
respectively), those may become computationally infeasible for very large databases. 

We improve with a simple design choice. Instead of an integer vector of usage per pattern in 
CT, we have a sparse matrix C for patterns versus data points, the binary entries c 3l indicating 
whether data tuple i contains pattern j in its cover. Note that the row sum of the C matrix gives 
the usages of the patterns. In such a setting, step (1) (mining for candidate patterns) works as 
follows: Let F) and Fj denote the feature sets to be merged. Let C, denote the ri n x n patterns 
versus data tuples matrix for code table CT, and similarly C 3 denote the n 3 x n matrix for CT n in 
which rii and n 3 respectively denote the number of patterns each table has. We obtain the usages 
for the new candidate patterns (merged unique rows) under the merged feature subspace F t \ 3 by 
multiplying C, and Cj into a v n x rij matrix U, which takes Oiriinrij). 

Next, we sort the merged patterns in decreasing order by their usage U x<y and insert them to 
C'TjU one-by-one. Note that we exploit the existing patterns in the code tables to be merged, 
rather than finding all the unique rows of the database. This way, we quickly identify good 
frequent candidates and consider only n t n 3 of them. Since n tl n 3 <F n, we practically reduce the 
number of inner iterations (12) to a constant. 

From here, step (2) (insertions) works as follows: Let U X:V denote the highest usage associated 
with the merged pattern Pi(x)\pj(y). The insertion of pdx) \p 3 {y) into the code table simply 
means the addition of a new row to the C t \ 3 matrix (C r \ 3 is obtained by concatenating the rows 
of Ci and Cj and reordering its rows (1) by length and (2) by usage of the patterns it initially 
contains). The new row is then the dot product (i.e. logical AND) of row x in C) and the row 
y in C 3 (data tuples which contain both Pi(x) and p 3 iy) in their cover). Moreover, we decrease 
the usages of the merged patterns pdx) and p 3 ( y) by subtracting the new row from both of their 
corresponding rows. All these updates are 0(n) operations. 

All in all, the inner loop (starting in 12) goes over rirrij (^constant) number of candidate patterns 
and each insertion takes (D(n). Therefore, the inner loop takes 0(n). 

Next, we consider the outer loop (starting in 5), which tries to merge pairs of code tables. In 
the worst case we get 0(m 2 ) trials when all merges are discarded. As a result the worst case 
complexity of CompreX becomes 0(m 2 n). In practice, however, many features are correlated 
and we obtain a merge at almost every step, yielding approximately linear complexity in both 
data size and dimension. 
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11.1.5 Anomaly Detection 

Compression based techniques are naturally suited for anomaly and rare instance detection. Next 
we describe how we exploit our dictionary based compression framework for this task. 

In a given code table, the patterns with short code words, that is those that have high usage, 
represent the patterns in the data that can effectively compress the majority of the data points. In 
other words, they capture the trends summarizing the norm in the data. On the other hand, the 
patterns with longer code words are rarely used and thus encode the sparse regions in the data. 
Consequently, the data tuples can be scored by their encoding cost for anomalousness. 

More formally, given a set of code tables CTi,..., CT k found by CompreX, each data tuple t G 
D can be encoded by one or more code words from each CTi = {1,..., k}. The corresponding 
patterns constitute the cover of t as discussed in § 11.1.1. The encoding cost of t is then considered 
as its anomalousness score; the higher the compression cost, the more likely it is “to arouse 
suspicion that it was generated by a different mechanism” [Hawkins, 1980]. 

score(t) = L(t\CT) = ^ L(n F (t)\CT F ) = E E L(code(p)\CTp) 

FeV F$lV p£cover(7TF(t)) 

Having computed the compression costs, one can sort them and report the top k data points with 
the highest scores as possible anomalies. An alternative way [Smets and Vreeken, 2011] is to 
determine a decision threshold 9 and flag those points with a compression cost greater than 6 as 
anomalies. One can use the Cantelli’s Inequality [Grimmett and Stirzaker, 2001] which provides 
a well-founded way to determine a good value for the threshold 6 for a given confidence level, 
that is, an upper bound for the false positive rate. For generality, in our experiments we show the 
accuracy at all decision thresholds. 


11.2 Experimental Results 


In our study, we experimented with many datasets from diverse domains, as shown in Tables 11.2 
and 11.3. Other datasets we used include graph and satellite image datasets, which we will 
discuss later in the experiments. 

We evaluated our method with respect to four criteria: (1) compression cost in bits, (2) running 
time, (3) detection accuracy, and (4) scalability. We also compared our results with two state of 
the art methods, Krimp [Smets and Vreeken, 2011] and LOF [Breunig et al., 2000a]. Krimp 
is also a compression based anomaly detection method, but uses a single code table to encode 
a dataset. It performs frequent itemset mining as a pre-processing step to generate candidate 
patterns for the code table. LOF is a density-based outlier detection method that computes the 
local-outlier-factor of data points with respect to their nearest neighbors. Neither of the methods 
are parameter-free, they respectively require the minimum support threshold and the number of 
neighbors to be considered. 
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11.2.1 Compression-cost and Running-time 


One goal of our study is to develop a fast method that would model the norm of the data and 
hence give low compression cost in bits. In this section, we use both CompreX and Krimp for 
compressing our datasets and show the total cost in bits and the corresponding running times in 
Figure 11.1 (a) and (b). Note that the results for our largest datasets Enron, Connect-4, and Cover- 
type are given with broken y-axes with values in millions for cost and thousands for time. 

We note that CompreX achieves very nice compression rates, outperforming Krimp for all of 
the datasets, and providing up to 96% savings in bits (47% on average). 

With respect to running time, we notice that for most of the (smaller) datasets, our running time 
is only slightly higher than that of Krimp but still remains under 16 seconds. Even though for 
small datasets the run time of Krimp is negligible, for datasets especially with large number of 
features, its running time increases significantly. The computationally most demanding part of 
Krimp is the frequent itemset mining, making it less feasible for large and high dimensional 
categorical datasets. For example, for the Connect-4 dataset with 42 features and the Chess 
(king,rook vs. king,pawn) datasets with 36 features, Krimp cannot finish within reasonable time 
due to very large candidate sets. 

To alleviate this problem, Krimp accepts a minsup parameter, which is the minimum number of 
occurrences of an itemset in the database to be considered as frequent. The higher the minsup 
is set, the fewer the extracted itemsets are. However, high minsup comes with a trade-off; the 
higher the minsup, the smaller the number of candidates, the smaller the search space and the 
worse the final code table approximates the optimal code table. In contrast, our method does not 
require any sorts of parameters and frequent itemset mining. 

In our experiments, we find closed frequent itemsets with minsup set to 5000 and 500 for the 
Connect-4 and Chess (kr-kp) datasets, respectively. Even then, the running time of our method 
remains lower than that of Krimp (see Fig. 11.1). On the other hand, the time required for 
frequent itemset mining also depends on the dataset characteristics. For example, we observe in 
Figure 11.1 that the running time of Krimp on (larger) Mushroom is much smaller than that on 
the (smaller) Spect-heart dataset, with both having the same number of (22) features. 

For our largest datasets (in terms of size and number of features) in Figure 11.1, notice that the 
running time of Krimp is quite large (about 20 mins) for Enron and Connect-4, and (45 mins) 
for Covertype. Moreover, its compression cost is still higher than that our method provides. We 
conclude that our method becomes more advantageous especially for large datasets. 


11.2.2 Detection Accuracy 

Besides achieving high compression rate, we would also (if not more) want our method to be ef¬ 
fective in spotting anomalies. In this section, we experiment with various types of data including 
transaction, graph and image databases. 
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Compression cost in bits 



Running time in seconds 



Figure 11.1: (top) Compression cost (in bits) when encoded by CompreX vs. Krimp. (bottom) 
Running time (in seconds) of CompreX vs. Krimp. For large datasets, extremely 
many frequent itemsets negatively affect the runtime for Krimp. 
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CompreX on categorical data 


For measuring detection performance, we use two-class datasets. The number of data points from 
one class is significantly smaller than that from the other class. We call these classes as minority 
and the majority classes, respectively. The data points from the minority class are considered to 
be the “anomalies”. Examples to the classes include poisonous vs. edible in Mushroom data, 
unaccountable vs. very good in Car data, and win vs. loss in Connect-4 data. 

As a measure of accuracy, we use average precision ; the average of the precision values obtained 
across recall levels. We plot the detection precision, that is the ratio of the number of true 
positives to the total number of predicted positives, against the recall (=detection rate), that is 
the fraction of total true anomalies that are detected. A point on the plot is obtained by setting 
a threshold compression cost—any record with a cost larger than that threshold is flagged as an 
anomaly. The corresponding precision and recall are then calculated. By varying the threshold, 
we obtain the curve for the entire range of recalls. The average precision then approximates the 
area under the precision-recall curve. 

Figure 11.2 shows the precision-recall plots of CompreX and Krimp on several of our categor¬ 
ical datasets. Here, a higher curve denotes better performance, since it corresponds to a higher 
precision for a given recall. The average precision values are also given in Table 11.2, for all 
the categorical datasets. Notice that for most datasets CompreX achieves higher accuracy than 
Krimp. This is obvious especially for the Car, Chess, Led and Nursery datasets. We notice that 
the performance of the methods also depends on the detection task. For example, both meth¬ 
ods perform well on the Mushroom dataset, for which the poisonous ones exhibit quite different 
features than the edible ones. However, the accuracies of both methods drop for the Connect-4 
dataset for which the detection task, i.e. which player is going to win the game given the 8-ply 
positions, is much harder. 


CompreX on numerical data 

While CompreX is designed to work with categorical datasets, it can also be used to detect 
anomalies in datasets with numerical features. For that, we first convert the continuous numerical 
features to discretized nominal features. There exist various techniques to this end. In our study, 
we consider several: linear, logarithmic, SAX [Lin et al., 2003], and MDL-based [Kontkanen 
and Myllymki, 2007] binning. Linear binning involves dividing the value range of each feature 
into equal sized intervals. Logarithmic binning first sorts the feature values and assigns the lower 
6-fraction to the first bin, the next 6-fraction of the rest to the second bin, and so on, until all 
the values are assigned to a bin. SAX has proved to be an effective symbolic representation, 
especially for time series data. MDL-based binning estimates variable-width histograms with 
optimal bin count automatically, for a given data precision. 

We experiment with these various discretization methods under various parameter settings. In 
Figure 11.3, we show the accuracy of CompreX versus Krimp on the Shuttle dataset, using 
logarithmic binning with 6=0.5, linear binning with 5 and 10 bins, SAX with 3 bins, and MDL- 
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Figure 11.2: Performance of CompreX vs Krimp on several two-class categorical transaction 
datasets. Plotted are, precision vs. recall for various thresholds to flag data points as 
an anomaly. Notice that CompreX outperforms Krimp for most detection tasks. 
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Dataset 

\D\ 

IT 7 ! 

FI 

minsup 

CompreX 

Krimp 

Car evaluation 

1275 

6 

6 

1 

0.99 

0.06 

Chess (k) 

4580 

6 

4 

1 

0.67 

0.01 

Dermatology 

132 

12 

12 

1 

0.45 

0.37 

Solar flare 0-1 

1312 

10 

6 

1 

0.19 

0.15 

Solar flare 0-2 

1211 

10 

6 

1 

0.11 

0.10 

Led display 

370 

7 

7 

1 

0.99 

0.31 

Nursery 

4648 

8 

6 

1 

1.00 

0.52 

Chess (kp) W-L 

1689 

36 

18 

500* 

0.14 

0.06 

Chess (kp) L-W 

1547 

36 

18 

500* 

0.03 

0.03 

Letter rec. A-B 

809 

16 

11 

1* 

0.17 

0.18 

Letter rec. P-R 

823 

16 

11 

1* 

0.57 

0.34 

Mushroom 

4258 

22 

5 

1* 

1.00 

0.93 

Digit rec. 0-1 

1163 

16 

12 

10 

0.75 

0.63 

Digit rec. 1-7 

1163 

16 

9 

20 

0.51 

0.34 

Spect-heart 

267 

22 

16 

1* 

0.12 

0.12 

Connect-4 

44k 

42 

11 

5k* 

0.02 

0.01 

Covertype 

286k 

44 

6 

280k* 

0.46 

0.27 





MAP 

0.48 

0.26 


Table 11.2: Average precision (normalized area under the precision-recall curve) for the cate¬ 
gorical datasets, comparing CompreX and Krimp. Further, we give dataset size, 
number of features and partitions, and minsup used for Krimp (star denotes closed 
itemsets). 


based binning with precision 0.01. Results are similar for many other settings and for the rest 
of the numerical datasets, which we omit for brevity. Notice that regardless of the discretization 
used, CompreX performs consistently better than its competitor Krimp. 

To this end, we compare our method with LOF on some numerical datasets. LOF is a widely 
used outlier detection method based on local density estimation. While it is quite powerful when 
applied to numeric data, it cannot be directly applied to categorical datasets. In this comparison, 
both methods require the careful choice of a parameter; the number of nearest neighbors k for 
LOF, and the binning method and its corresponding parameter for CompreX. 

In Figure 11.4, we show the accuracy of LOF versus CompreX with their best parameter choices 
on our numerical two-class datasets. Table 11.3 gives the corresponding average precision scores 
for all three methods. Notice that CompreX achieves comparable or better performance than 
LOF, even though it operates on discretized data which loses some information due to this pro¬ 
cess, and thus is not as optimized as LOF for numeric data. This shows that CompreX can 
also be applied to datasets with numerical or with a hybrid of both categorical and numerical 
features. 
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Figure 11.3: Performance of CompreX vs Krimp on the two-class numerical transaction 
dataset Shuttle. Notice that CompreX outperforms Krimp consistently for vari¬ 
ous discretization methods used. 
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Figure 11.4: Performance of LOF vs CompreX with the best choice of parameters on the nu¬ 
merical transaction datasets. Notice that CompreX achieves comparable or better 
performance than LOF even after discretization. 


CompreX on graph data 

Given data points and their features in numerical or categorical space, our method can also be 
applied to other complex data, including graphs. To this end, we study the Enron graph, in which 
nodes represent individuals and the edges represent email interactions. In our setting the nodes 
correspond to the data points, and the features to the ego-net features we extract from each node. 
The ego-net of a node (ego) is defined as the subgraph of the node, its neighbors, and all the 
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Dataset 

\D\ 

IT 7 ! 

\v\ 

minsup 

CompreX 

Krimp 

LOF 

Shuttle 

3416 

9 

5 

1 

1.00 

0.22 

0.83 

Pageblocks 

4941 

10 

6 

1 

0.46 

0.37 

0.23 

Yeast 

468 

8 

8 

1 

0.49 

0.17 

0.48 

Abalone 

703 

7 

3 

1 

0.42 

0.15 

0.24 





MAP 

0.59 

0.23 

0.45 


Table 11.3: Average precision for the numerical datasets, comparing CompreX, Krimp and 
LOF. Further, we give dataset size, number of features and partitions, and minsup 
used for Krimp. 


links between them. We extract 14 numerical ego-net features, such as the number of edges, total 
weight, ego-net degree (number of edges connecting the ego-net to the rest of the graph), in- and 
out-degree, etc. Features are discretized into 10 linear bins. 

In Table 11.4, we show the top 5 email addresses with the highest compression cost found by 
CompreX. The dataset does not contain any ground truth anomalies, therefore we provide anec¬ 
dotal evidence for the discovered “anomalies”. Our first observation is the significantly high 
compression cost of the listed points—103 to 107 bits given a global average of 6.72 bits (me¬ 
dian is 4.27). This is due to the rare and high number of patterns used in their cover. Notice 
that each of them are covered with 7 patterns compared to a global average of 2.05 (median is 
2). Moreover, the usages of the cover patterns is quite small—thus longer code words and high 
total compress-cost. Further inspection justifies our results: for example, ‘sally.beck’ (employee 
chief operating officer) contains the highest number of (31k) edges in its ego-net and the highest 
ego-net degree (of 85k), implying that she is highly connected to the rest of the graph as opposed 
to many other nodes in the graph. 


name @ enron.com 

cost (bits) 

\cover\ 

avg usage±std 
of cover patterns 

sally.beck 

107.28 

7 

3.7 ±5.4 

jeff.dasovich 

107.11 

7 

3.4 ±3.9 

outlook.team 

106.70 

7 

4.8 ±6.3 

david.forster 

105.11 

7 

4.1 ±5.2 

kenneth.lay 

103.24 

7 

5.8 ±7.5 

robert.badeer 

1.53 

2 

52k ± 4.3k 


Table 11.4: Top-5 anomalies for Enron, with one regular-joe example. Given are, email ad¬ 
dress, compression cost, size of cover, and average usages of the cover patterns; 
high usages correspond to short codes. Average cost is 6.72 bits, average number of 
covering patterns is 2.05. 
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CompreX on image data 


Next, for our image datasets for which class labels also do not exist, we provide an anecdotal 
and visual study. The image datasets are the satellite images of four major cities from around 
the world as shown in Figure 11.5. Each image is split into 25x25 rectangle tiles, for which we 
extracted 15 numerical features, and subsequently discretized into 10 linear bins. The first three 
features denote the mean RGB values for each tile and the rest denote the Gabor features. 



(a) Holy see, Vatican 



(b) Washington D.C., USA 




(c) Forbidden city, Beijing 


(d) London, UK 


Figure 11.5: “Anomalous” tiles—with high compression cost— on the image datasets are high¬ 
lighted with red borders (figures best viewed in color). Notice that CompreX 
successfully spots qualitatively distinct regions that stand out in the images. 


In Figure 11.5, top tiles with high compression cost are highlighted in red. We observe that 
CompreX effectively spots interesting and rare regions. For example, the districts of Roman 
Catholic Church in Vatican and the Washington Memorial in Washington D.C. that distinctively 
stand out in the images are captured in top anomalies. In Forbidden city, CompreX spots the 
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three lakes (Beihai, Zhonghai, Nanhai), the Jingshan Park on its right, as well as the Tianan¬ 
men Square on the south. Finally, in London CompreX marks the part of Thames river, the 
Buckingham Palace as well as several rare plain fields in the city. 


11.2.3 Scalability 

Krimp falling short on large datasets arises the question of scalability, which is maybe even 
more important than the issue of speed. Therefore, in Figure 11.6, we also show the running 
time of both methods for growing dataset and feature sizes on Enron and Connect-4. 

We observe that the running time of Krimp grows significantly with the increasing size in both 
cases. The difference is evident especially for growing feature size. This is due to frequent 
itemset mining scaling exponentially with respect to number of features. On the other hand, the 
increase in running time for our method follows a linear trend indicating that CompreX scales 
better with increasing database size and dimension. 
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Figure 11.6: Scalability of CompreX and Krimp with increasing dataset and feature size, (top) 
Enron with large n, (bottom) Connect-4 with large rn. Notice that CompreX 
achieves better scalability for large databases, following a linear trend in growth. 


We note that the running time of CompreX could be further improved by operating on a smaller, 
random sample of the data. In a large enough random sample, the patterns are expected to occur 
as frequently as they occur in the original dataset and therefore the code tables built after random 
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sampling would be close to those built from the entire data. The theoretical and analytical study 
of random sampling for code-table compression, in particular the time versus accuracy trade-offs 
that the sampling poses, constitutes a promising research direction. 

An advantage of our method is that it can work as an any-time algorithm. As it is a bottom- 
up approach which seeks for lower total cost over iterations via more merges, the compression 
procedure can be stopped at any point given the availability of time. The same goal can be 
achieved for Krimp by setting the minsup parameter accordingly; the higher the minsup, the 
lower the running time. 

To compare the methods, we show the compression costs achieved at various durations in Fig¬ 
ure 11.7. Notice that for any given run-time duration, CompreX achieves lower compression 
cost than its competitor. Moreover, even Krimp is allowed to run for longer, it still cannot reach 
as low of a cost as CompreX can in much less time. 
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Figure 11.7: Scatter plots of compression cost versus running time of COMPREX and Krimp. 

Notice that COMPREX achieves lower cost at any given time, which Krimp cannot 
catch up with even when let to run for longer. 
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11.3 Summary of contributions 


In this chapter, we introduced a novel, parameter-free method called CompreX for the impor¬ 
tant task of anomaly detection in complex multi-dimensional databases. COMPREX builds a data 
compression model that uses multiple dictionaries for encoding, and reports the data points with 
high compression cost as anomalous. Our method proves effective for both tasks: It provides 
higher compression rates at lower running times than its state-of-the-art competitor Krimp, es¬ 
pecially for large datasets. In addition, it is capable of spotting rare instances effectively, with 
detection accuracy that is often higher than its closest competitor Krimp on categorical datasets, 
and LOF, its state-of-the-art competitor on numerical datasets. 

Experiments on diverse datasets show that CompreX successfully generalizes to a broad range 
of datasets including image, graph, and traditional relational databases with both categorical and 
numerical features. Moreover, it is scalable , with running time growing linearly with increasing 
database size and dimension. 


168 



Chapter 12 

Events in Time-evolving Graphs 


PROBLEM Statement: At what points in time many of the nodes in a given time-varying 
graph change their behavior significantly? Can we attribute the change to specific nodes, that 
is, can we characterize which nodes change in behavior the most? 


In this chapter, we develop an algorithm to spot change-points in a time-varying graph at which 
many nodes deviate from their normal ‘behavior’. In a nut-shell our method works as follows. 
We first extract time sequence of several network (egonet) features for all nodes in the graph. 
Then, we derive a ‘behavior’ vector of all nodes and compare it to recent past ‘behavior’ vectors 
detected over several previous time windows. If the current behavior is found to be significantly 
different than recent past, we flag the current time window as anomalous and report as an event 
has occurred. Moreover, in order to attribute the change to a subset of nodes which contribute to 
the change score the most, we compare their ‘behavior’ values and mark those nodes for which 
this value differs much more than others. 

To demonstrate the effectiveness of our method, we study the texting behavior of users in our 
SMS network. This data consists of six months’ of activity and is therefore time-varying. Also, 
the edges are weighted, weights denoting the total number of SMSs sent/received between in¬ 
dividual pairs. More specifically, the SMS network constructed from the SMS records includes 
over 2 million users with 50 million SMS interactions between them over a period from Dec. 1, 
2007 to May 31, 2008. 


12.1 Event Detection: Method Description 

12.1.1 Feature Extraction 

In order to find patterns that nodes of a graph follow, we characterize the nodes with several 
features so that each node becomes a multi-dimensional point. In particular, each node is sum- 
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marized by a set of features extracted from its egonet (egonet of a node includes the node itself, 
its neighbors, and all the interactions between these nodes). The 12 features considered in this 
work are as follows: 1) in-degree, 2) out-degree, 3) in-weight, 4) out-weight, 5) number of neigh¬ 
bors, 6) number of reciprocal neighbors, 7) number of triangles, 8) average in-weight, 9) average 
out-weight, 10) maximum in-weight, 11) maximum out-weight, and finally 12) maximum weight 
ratio on reciprocated edges in the egonet. 


12.1.2 Change-Point Detection 


The flow of our method to detect change-points in the behavior of nodes is illustrated in Fig¬ 
ure 12.1. This method is similar to [Ide and Kashima, 2004], but differs in the construction of 
the ‘dependency’ matrices C (top right in the figure) as we describe in the following. 
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Figure 12.1: The Work-flow of Change-Point Detection 


Here, the data we study looks like the 3 — D tensor on the top left of Figure 12.1, where T denotes 
the number of time ticks (T = 183 days), N denotes the number of nodes in our graph (N = 2 
million customers), and F denotes the number of features extracted for each node (F = 12 as 
described in the previous section). To start with, we take one ‘slice’ of this 3 —DTxNxF tensor 
for a particular feature F t , say in-weight, which is a T x N matrix (top middle in the figure). 
Next, we define a window of size W over the time-series of values of all nodes for that particular 
feature F). Then for pair of nodes, we compute the correlation between their time-series vectors 
over the window of size W using Pearson’s p as follows. 
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In the above equation, X and Y are the length-W vectors for each node pair (A", Y). So, for each 
window we construct a correlation matrix C , where C x _ y = p(x. y ) over window W. Next, we 
slide the window down one time tick (day) and compute the correlations over the next window of 
W time ticks. Similarly we keep repeating this process until we reach the end of our data. To be 
representative and given the periodic behavior of human nature, we chose the size W as 7 days 
(one week). As a result, we end up constructing 177 C matrices (top right in the figure). 

By the Perron-Frobenius theorem [Pillai et al., 1912], the largest (principal) eigenvector of each 
of the C matrices is positive. The value for each node in the eigenvector can be thought as the 
‘activity’ of that node; that is, the more correlated a node is to the majority of the nodes, the 
higher its ‘activity’ value will be. Here, we call each such size N eigenvector as the ‘eigen- 
behavior’ of all the nodes in the graph on the whole. 

Note that in principle, we do not have to explicitly compute the C matrices in order to obtain its 
eigenvector(s). Since C = W T W after standardizing IP’s rows, the right singular vectors of W 
are the same as the eigenvectors of C. Therefore, we obtain the eigen-behaviors by operating 
directly on the W matrices. 


Metric to Score Time Points for ‘Event’ 

After finding all the eigenvectors for all the 177 C matrices, the change-point in the ‘eigen- 
behavior’ of the nodes is found as follows. For the eigenvector computed at time say t denoted 
by u(t), we compute a ‘typical eigen-behavior’ denoted by r(t — 1) from the previous W' eigen¬ 
vectors (bottom right in the figure). We experimented with two different ways to compute the 
mentioned typical eigen-behavior. Firstly, we simply took the arithmetic average of all previous 
W' eigenvectors. Secondly, we constructed a new N x W' matrix, each column being an eigen¬ 
vector over that window, and then we computed the left singular vector of that new matrix using 
SVD decomposition. Similar to the principal eigenvector for square positive matrices, the left 
singular vector of a positive non-square matrix yields an average ‘behavior’ score for all nodes. 
One can think of the left singular vector as the weighted average of eigenvectors in W’. 

Finally, after we obtain the ‘typical eigen-behavior’ for each C matrix (for each week) using 
either SVD or regular averaging, the ‘eigen-behavior’ u(t ) computed at time t is compared to 
the ‘typical eigen-behavior’ r(t — 1) by taking the dot-product of those two unit vectors. The 
change metric we used is Z = (1 — u T r ). Here, if u(t ) is perpendicular to r(t — 1), then their 
dot-product gives a value of 0, or Z = 1, whereas if u(t) is exactly the same as r(t — 1), then 
their dot-product gives a value of 1, or Z — 0. Therefore, Z takes values between 0 and 1, and a 
higher value of Z indicates a change-point (bottom left in the figure). 
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12.2 Experimental Results 


We start our analyses by looking at the distribution of correlation values C x>y in the C matrices. 
Figure 12.2 shows the histogram as well as the CDF of C x:y values for two different days, Dec. 
1 and Dec. 26, for feature F:in-weight. 


XlO* Xl0 !! 





Figure 12.2: (top) Histogram and (bottom) CDF distribution of correlation scores C x>y for (left) 
Dec. 1 and (right) Dec. 26 using F: in-weight. 


Here, one observation is that the distribution of correlations between time-series of nodes is 
skewed as might be expected. Surprisingly, though, it is skewed towards large values. That is, 
there are lots of pairs with correlation score close to or equal to 1. This happens because over 
the time window W of 7 days, most of the nodes have no activity -their W-length vectors are all 
0’s and thus the pair-wise correlations of such 0 vectors are computed to be 1. This suggests that 
the nodes have bursty activity where nodes have no activity for several weeks and have activity 
at bursts. 

Another observation is that the total number of correlation scores 1 reduces in Dec. 26 compared 
to Dec. 1, suggesting for no activity weeks for fewer nodes, that is, more nodes become active 
during the week of Dec. 26. This is expected as this week is the New Year week. We note that 
the CDF distributions for these two days also look different. These observations strengthen our 
belief of studying correlations between behaviors of nodes would be important in detecting the 
change-points in our data. 

Next, we compare the results when using SVD versus taking the regular average (AVG) for 
computing the ‘typical eigen-behavior’ r(t — 1) of earlier eigenvectors over a window of W’ 
(see Section 12.1.2). Figure 12.3 shows the so-called Z scores computed (1) when r(t — 1) 
is computed with SVD (in blue bars) versus (2) when r(t — 1) is computed by simply taking 
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the average (in red lines) for four different values of W’, (from left to right, top to bottom) 
W' = 5, 7, 20, 50. Notice that the red line almost exactly follows the blue bars. This means that 
SVD is giving equal weight (importance) to all W’ eigenvectors in the past same as the AVG 
does. Therefore, since computing the average is less expensive, we will use the AVG to compute 
r(t — 1) in the rest of our experiments. 

We also note that the Z scores follow somewhat a similar trend when different window sizes are 
considered. However, the larger the window gets, the more aggregated the results become (notice 
that the four spikes for W' = 5, 7 reduce to two spikes only for W' = 20,50). In the rest of our 
experiments, we use W' = 5 over which the r(t — 1) vector is computed by using AVG. 
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Figure 12.3: Z scores computed when the typical eigen-behavior vector r(t — 1) is computed by 
taking the SVD (blue bars) versus the regular AVG (red lines) for (from left to right 
top to bottom) W' = 5, 7,10, 20. Notice AVG is very similar to SVD. 


12.2.1 Detected Change-Points 

After computing the deviation scores Z as was explained in Section 3.3, we use a simple heuristic 
to flag the high Z scores. Rather than using a threshold value, we simply compute the difference 
between two consecutive Z scores and rank the time points according to \Z(t) — Z(t — 1)|. 
Figure 12.4 shows the top 10 time ticks for which the difference score is the highest. Here, 
feature F is taken to be the “inweight”. Experiments with other features such as “number of 
reciprocal edges” and “outdegree” also flag similar time points which we will discuss later in 
this section. 

In Figure 12.4, we observe that the top 2 time periods correspond to the weeks of Christmas and 
New Year (Dec 26, Jan 2). This shows that even though our data comes from India and mostly 
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Figure 12.4: Top 10 time points with highest Z-scores flagged by our method (red bars) for 
feature Fflnweight. Numbers on bars indicate rank of each day by Z-score. 


people are not Christian, they would be “celebrating” the Christmas. The reason that Jan 2nd 
rather than Jan 1st is flagged is it is a change-point in which things went back to normal. 

Another surprising finding is with the 3rd time tick which is Apr 7th. Similar to Jan 2nd, this is 
also a time-point where things turned back to normal. The actual interesting day here is indeed 
Apr 6th: http : / / www .inf op lease, com/ipa/A07774 65 . html lists Apr 6th as the 
“Hindi New Year” (our data is in 2008). These results suggest that our method is effective in 
finding points in time for which the collective behavior of nodes deviate from recent past. 

As a sanity check, we ran our method on other features such as number of reciprocal edges and 
outdegree. Figure 12.5 shows that our method flags almost the same time points including Jan 
2nd and April 7th also with these features. Moreover, the difference/spike in the Z score is 
even clearer with these methods. This is intuitive in the sense that even though the “inweight” 
(number of SMSs received) is expected to increase on days such as Christmas and New Year, the 
number of reciprocated interactions are expected to increase even more (people tend to reply to 
celebration messages on such days). 

We also compared the results of our method to the results we obtain by using the sheer volume 
at each time tick. In particular, we computed the total number of SMSs received (in-weight) per 
day and marked the top 7c = 10 time points for which the most number of SMSs were received 
in total. We repeated the same process for total number of reciprocal replies (numrecip) and total 
number of out-going contacts (out-degree). 

We show the results in Figure 12.6 where the red bars depict the top 10 time points with highest 
volume. We observe that just the total volume for all three features was enough to detect change 
on for example Dec. 31 and Jan 1. However, we realize that the points reported for each feature 
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Figure 12.5: Top 10 time points with highest Z-scores flagged by our method (red bars) for 
(left) F: number of reciprocal edges and (right) F:outdegree. Notice the flagged 
time points are similar to those using Funweight in Figure 12.4. 


also partly differ from each other and are not as consistent as the earlier results. For instance 
April 6, even though was detected as a change-point using features ‘numrecip’ and ‘out-degree’, 
was not detected using ‘in-weight’. 

The main reason behind these observations is that our method considers every person in the net¬ 
work individually and flags change-points if the majority of them change their ‘normal’ behavior 
whereas the total volume considers the aggregated behavior. The aggregated data loses informa¬ 
tion in the individual level and thus is prone to flag change-points if only a few people change 
their behavior sufficiently a lot. 
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Figure 12.6: Top time points flagged by using total volume (red bars). 


12.2.2 Attribution of Change 

Here the question is for a given change-point detected in the previous section, can we go back 
and detect which node(s) contributed to the change the most? 
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Figure 12.7 shows the scatter plot of the values of the eigen-scores u(t) versus the typical pattern 
r(t — 1) scores for all the nodes on December 26th. Here, we observe that most of the values 
lie on the diagonal, which shows that a majority of the nodes did not change much on their 
typical behavior. On the other hand, some points that are far off-diagonal (marked with red stars) 
contribute to the Z score the most. 


t=26-DEC W=5 



Figure 12.7: Scatter plot u(t) versus r(t — 1) of nodes on December 26th. Each blue dot depicts 
a node. Nodes far away from the diagonal change in “behavior” the most (top 5 
marked with red stars). 


Similarly, Figure 12.8 shows the amount of change ratio ‘^ (%) for 10K nodes. Again, 

the same top 5 nodes as in Figure 12.7 are marked in red. 
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Figure 12.8: Change ratios (%) of top 10K nodes in u(t) and r(t — 1). Each bar depicts a node 
(top 5 with highest change ratio is shown in red). 


Since the data does not contain ground truth of anomalies, in Figure 12.9 we plot the time series 
(inweights versus days) of these top 5 nodes marked in Figures 12.7 and 12.8 (each row for each 
node). Here we observe that, three of the nodes (rows 1, 4 and 5) have no activity on the week 
of Dec 26th. This is flagged because they are observed to have some activity over the previous 


176 






































































































































weeks. On the other hand the other two nodes (rows 2 and 3) have the opposite behavior: they 
start receiving SMSs during Christmas. We also observe that these two sets of nodes lie in 
the different halves of the diagonal in Figure 12.7, also indicating an opposite change in their 
behaviors. 


§ 100 

CD 50 - 

I 0 


■§,100 

'cd 50 
% 0 



nodeind=3785 t=26 "26-DEC" W=5 


§40 

CD 2 
% 0 


§20 
CD 1 
% 0 


O) 

'CD 


10 


0 

50 

100 

150 

200 



time 





nodeind=6763 

t=26 "26-DEC" 

W=5 



- 

1 

1 

1 

1 

- 


0 

50 

100 

150 

200 



time 





nodeind=7054 

t=26 "26-DEC" 

W=5 



- 

J 




- 


0 

50 

100 

150 

200 



time 





nodeind=8051 

t=26 "26-DEC" 

W=5 



.1 


. . 1 ■ . . i 

.1 1.. . 

k Iji 

- 


0 

50 

100 

150 

200 



time 





nodeind=5535 

t=26 "26-DEC" 

W=5 




L 

1 1 1 


0 

50 

100 

150 

200 


time 


Figure 12.9: Time series of inweight values of top 5 nodes with highest deviation in “eigenbe- 
havior” marked in Figures 12.7 and 12.8. Beginning and end of week December 
26th is marked with red and green vertical bars on the time line, respectively. 
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12.3 Summary of contributions 


In this chapter, we introduced an algorithm based on ‘eigen-behavior’ analysis to (1) spot ‘change- 
points’ in time at which the majority of the nodes in a given network deviate from their normal 
behavior; and (2) do ‘attribution’ to point out the specific nodes that are most related to the cause 
of a detected change-point. 

We validated the effectiveness of our method on a network dataset of millions of mobile phone 
users and their SMS interactions over half a year. Although there exists no ground truth informa¬ 
tion in our SMS data analyzed, the experimental results suggest that our method is able to detect 
interesting time points. 

Our contributions are summarized as the following: 

• We used an “eigenbehavior”-based method on the time-series of users and considered the 
amount of change in their “eigenbehaviors” to flag change-points in time. 

• Our method can also be reverse-engineered to spot the top users who contribute to the 
changes the most. 

• Realistic anomaly detection is difficult with unlabeled data, but our results have demon¬ 
strated that we were able to detect events that coincide with major holidays and festivals in 
our data, as well as users whose change of behavior is evident during these events. 
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Chapter 13 

Sensemaking for Graph Anomalies 


PROBLEM STATEMENT: How can we use the network structure to summarize a set of (anoma¬ 
lous) nodes, by partitioning them such that for each part we have a simple tour connecting the 
grouped nodes, while nodes in different parts are not easily reachable? In other words, how can 
we summarize the set of (anomalous) nodes for easy sensemaking? 


We motivate the above questions under the anomaly summarization setting. That is, given k 
nodes marked by an anomaly detection algorithm, we could group them to reveal associations and 
connectors, instead of simply listing them. This provides a better understanding (sensemaking) 
of how these nodes are correlated within the graph; whether they are ‘close’ to each other, and 
what pathways or other connectors play crucial role in their associations. 

While in this chapter we mention the sensemaking for anomalies scenario, these questions find 
applications in numerous other settings, examples include the following. 

• Given a gene interaction network, an experiment may reveal for particular conditions a 
number of genes to be up (or down) regulated [Liekens et al., 2011], how can we partition 
them with respect to their closeness in the graph? This would give us a good summary 
of groups of close-by correlated genes as well as possible pathways the genes may be 
involved in. 

• Given k nodes marked by an anomaly detection algorithm, how can we explain the anoma¬ 
lies? Instead of simply listing them, we could group them and reveal their associations 
within groups. 

• Given an event affecting a set of nodes in a graph (e.g. people affected by a certain disease, 
people buying a particular product), how can we group the nodes such that the network 
structure can be associated with the spread of the event within groups but not quite across 
groups? 

• Given a social graph like Facebook and user attributes of interest (e.g. white girls and black 
boys in the same school), how can we explain their relations using the graph structure? The 
partitions, if any, and connection paths among the chosen students may be of interest to 
segregation studies [Shrum et al., 1988] in social sciences among others. 
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• Given the Web graph and a set of top ranked pages (nodes) returned by some keyword 
search, how can we group these nodes matching the keywords into groups and reveal con¬ 
nection paths among them, rather than simply listing them? 

Intuitively, we want to partition the given set of marked nodes such that the nodes within a part 
are ‘close-by’ while nodes across parts are far apart. In addition, for the ‘close-by’ nodes in each 
part, we want to find a ‘succinct’ subgraph connecting them. Moreover, we rather not visit nodes 
of very high degree, such as hubs in social networks, because those nodes connect to virtually 
everything in the graph, they do not provide much information by association. Therefore, we 
think of such high degree nodes as separators, and try to avoid including them in our ‘succinct’ 
subgraphs. 

For example, consider Figure 13.1. In (a), a list of 20 authors from DBLP are marked. In this 
plot, there is no information (other than author names) that explains any correlation among the 
authors. In (b), the marked nodes are projected to and highlighted in the co-authorship graph. In 
contrast, here it is hard to observe any patterns as there is information overload. We show our 
result in (c), which explains the marked nodes with two well-separated groups (islands) as well 
as revealing simple connections and connectors (bridges) among all the marked nodes. 

We formalize this problem in terms of the Minimum Description Length principle [Griinwald, 
2007]: a tour (collection of paths) is simple when we need few bits to direct the user from one 
node to the other. Hence we typically do not want to visit nodes of high degree, as it is more 
expensive to identify which edge to follow leading from it. Similarly, we require more bits if 
we have to visit many unmarked nodes in order to arrive to the next marked node. As such, the 
best tour, i.e. the summary, is the one for which we need the least number of bits to identify all 
marked nodes. 

We show this problem is NP-hard, and has connections to well-known problems in network the¬ 
ory. We discuss a number of approximate methods for finding a partitioning and its respective 
simple paths, and introduce Dot2Dot, an efficient algorithm for summarizing marked nodes. 
Experimentation shows DOT2DOT correctly groups nodes for which simple paths can be con¬ 
structed, while separating distant nodes. 


13.1 Dot2Dot: Method Description 

13.1.1 Problem Definition 

Given a graph G = ( V. E ) and a set of marked nodes M C V, we consider the following two 
related problems. 

• Problem 1. Optimal partitioning Find a ‘coherent’ partitioning P of M. Find the optimal 
number of partitions P automatically. 

• Problem 2. Optimal connection subgraphs Find the ‘minimum cost’ set of subgraphs 
that connect the nodes in each part pt e P efficiently. 
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Figure 13.1: 20 chosen authors from DBLP. Edges denote co-authorship relations. 
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Figure 13.2: Toy graph with 8 marked (black) nodes. Our Dot2Dot algorithm automatically 
‘describes’ them in 3 groups, discovering ‘missing connectors’ (nodes ‘4’ and ‘9’). 


The two problems we consider above involve three sub-problems: 

• (sub-problem 1) how to define the ‘coherence’ of a set of nodes, 

• (sub-problem 2) how to define the ‘cost’ for a connection subgraph, and 

• (sub-problem 3) how to find the connection subgraph(s) quickly in large graphs. 

As an example, let us consider Figure 13.2. In it, we depict a simple graph in which 8 nodes have 
been marked. It is clear to see that given the graph structure, the marked nodes naturally form 
three groups: p\ = (1, 2, 3}, p 2 = {10}, and p 3 = {5, 6, 7, 8}, which are all well separated by 
the big star-node in the middle. While only nodes 5 and 6 have a direct connection, for each part 
there exists a connection subgraph, here highlighted with bold edges, such that we can construct 
a simple dot-to-dot ‘road map’ of which branches to follow in order to visit all marked nodes in a 
part—without having to visit too many unmarked connector nodes, like node 4 for p\, and node 
9 for />;• 

Since we are interested in summarizing, that is, in describing the given marked nodes as suc¬ 
cinctly as possible, we borrow ideas from Information Theory and employ the Minimum De¬ 
scription Length (MDL) principle (see Section 8.2 for background on MDL). 


13.1.2 Formulating Tours: Theory and Objective 

The key idea behind our method builds on an encoding scheme, involving one sender and one 
receiver. We assume both the sender and the receiver already know graph G = (V, E ) and only 
the sender knows the set of marked nodes M. The goal of the sender, then, is to come up with an 
encoding scheme to transmit to the receiver the information of which nodes are marked, using 
as few bits as possible. 

One straightforward way to encode the set of marked nodes is to encode their node-id’s sep¬ 
arately, using log \ V\ bits each. On the other hand, it might be more efficient to exploit the 
neighborhood information for ‘close-by’ nodes. In the simplest case, for example, if nodes u and 
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v are both marked and they are direct neighbors, i.e. e(u, v) E E, then the sender can encode 
u' s id and then use only log d(u) bits to encode v, where we assume a canonical order on nodes 
(say by increasing node-id) and d(u) denotes the degree of u in G. As such, the sender follows 
a path from one marked node to the other to encode ‘close-by’ nodes. Depending on the graph 
structure, from time to time it might be more efficient to restart encoding from a new node by 
directly providing its id, in case the cost of the path to that node exceeds log \ V\ bits. In fact, that 
is exactly what determines the partitioning P of the nodes. 

Simply put, one can imagine the way of encoding discussed above as hopping from node to node 
for encoding close-by nodes and from time to time flying to a completely new node for encoding 
farther nodes until all marked nodes are encoded. This resembles a tour (union of paths) on 
the graph that travels from a marked node to another which succinctly describes the marked set 
(hence the name Dot2Dot). 

In effect, we are after the shortest description for a group of marked nodes M C V in a graph 
G = iy,E). More generally, the idea is that per part p, of P, we find the easiest/simplest 
subgraph T in G that spans at least all marked nodes in p it Simplicity of T is determined by the 
number of nodes we visit in this tour, how many unmarked nodes we visit, and in particular how 
easily per visited node we can identify which edge we have to follow next; nodes with (very) high 
degree hence make the path more complex, or, less likely. Also notice that the simplest subgraph 
would in fact be a tree since it would require less bits to refer to a node we have already visited 
in our encoding. 

In this section, we describe the cost function for a given partitioning P and the given connection 
trees for each part pi. We use this cost function as our objective function that we aim to minimize 
for model selection. 

More formally, we first transmit the number of partitions P|, for we need at most log \ V\ bits as 
there will be at most \V\ marked nodes in total, which in the worst case are all put in separate 
parts. 

L(\P\) = log \V\ (13.1) 

Then, per part p t E P, we have a tree T spanning at least the marked nodes of p, . To identify the 
root node of T in G we have to spend log \ V\ bits. 

Then, recursively per node t E T, we transmit how many branches t has, denoted by \t\. As t 
corresponds to a node v t in G, and d(v t ) gives us the out-degree of v t , we can transmit \t\ < d(v t ) 
in log d(v t ) bits. On the other hand, since a simple tree would presumably have small branch- 
out factor, we choose to encode \t\ using universal integer encoding [Griinwald, 2007]. This 
encoding specifies that in order to encode a non-zero positive integer n we require L N (n) = 

log* n + log(c) bits with c = 2“ logn ~ 2.865064, and log*(n) = log n + log log n 4-sums 

over all positive terms. So, as \t\ can be zero (for leaves), we transmit its value in E\ (1*1 + 1) 
bits. 

Next, we identify which out-edges of v t have to be visited. This we can encode most succinctly by 
assuming a canonical order of all possible subsets of selected edges of that size, and transmitting 
the index of the actual subset. This takes log bits. 
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Leaf-nodes are easily identified as their number of branches is given as 0. If such is the case, we 
traverse back up the tree until we find a node with an unvisited branch. Once all branches have 
been visited, we stop transmitting the structure of the tree. 

Now that we know which nodes Pi C V are in our tour, we need to know which of these are 
marked. Let |Tj denote the number of nodes in T, and ||Tj| the number of marked nodes in 
T. As the recipient now knows the tree, and hence |Tj, and ||Tj| < \T\ we need log |Tj bits to 
transmit | |Tjj. Next, we need to identify which | \T\ | nodes of T are marked, which we again do 
by a binomial: log (|j^j|)- 

As such, for one part p* e P, we have 

L(Pi) = log |y| + L(t) + log \T\ + log ( j^j J (13.2) 

in which per node in the tree of p. t 

L(t ) = L®(\t\ + 1) + log ^ |^| ^ ^ L(b(t,j)) 

where b(t,j) identifies the node t! in T we reach from node t by descending branch j (notice that 
by its recursive nature, L(t ) encodes the branching cost for all tree-nodes). 

Putting this together, we get 


L{P,M\G) = L{\P\) + Y t L{p i ) . (13.3) 

i 


Note that our initial assumption that both the sender and the receiver know G does not affect 
model selection: as G is constant for all possible sets of marked nodes on G, explicitly transmit¬ 
ting G would only add a constant cost for all possible models under consideration, and hence not 
influence measuring the quality of a model. 

Given the above formalization, we now only have to find the optimal partitioning P of M, and 
the optimal tours per part p, e P such that L(P, M \ G) is minimized. 


13.1.3 NP-hardness 

The total encoding cost L(P, M \ G ) can score a given set of solutions and point to the best 
among them, however it does not provide direct means to find the optimal solution. 

In this section, we show that this problem is NP-hard, with a reduction to the well-known Steiner 
tree problem: given an undirected unweighted graph G = ( V. E ) and a subset of nodes X C V, 
the objective is to find the minimum cost tree that spans all the nodes in X. The cost of a tree is 
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defined as the total number of nodes, i.e. total number of edges, in the tree. Note that the tree 
may include nodes in X, as well as other nodes which are typically referred to as Steiner nodes. 
Steiner trees are well-known in graph theory, and find application in multicast models for finding 
good server nodes in computer networks for avoiding network congestion and reducing latency 
in multi-user streaming data settings. 


Theorem 3 Minimizing L(P, M \ G ) is NP-lmrd. 

Proof Let G = (V.E) be a graph and let M C k be the marked nodes. Let k — \V\ the number 
of nodes. We will construct a graph G' = (V. E') such that solving the Steiner tree problem for 
G will reduce to finding a partition with optimal cost in G'. 

In order to do that, let d be the maximal degree of a node in G. We are now ready to define the 
graph G'. The nodes V consists of the nodes V and for each edge e E E we will add s auxiliary 
nodes. We will define s later. In addition, we will add a large amount, say r, of isolated nodes, 
we will specify r later. For each edge e(v,w) E E we will add a path of s + 2 nodes connecting 
v and w with s nodes that were created specifically for this edge. We will refer to these paths as 
auxiliary paths. 

Note that auxiliary nodes always have a degree of 2 and they will never be leaves of trees cor¬ 
responding to the optimal partition. That is, either the whole auxiliary path is in the partition or 
none of the nodes in that path is included. 

The cost of having a non-root auxiliary node in a partition is c = LG2 + log (^). If auxiliary node 
is a root, then the cost is c r = L N 3 + log Q. 

The encoding of one part is equal to 

log \V'\ +ncs+ corr + log \T\ -flog ^ L N |f| + 1 + log 

^ teT,v t eV 

where n is the number of auxiliary paths in T. The correction term corr takes into account the 
possibility that auxiliary node may be a root term. If this is the case, then corr = c r — c, otherwise 
corr = 0. We can bound the last 4 terms by 

b = log k + k\ogk + k(L^d + 1 + d log d). 

We can now bound the cost by 



log \ V'\ + ncs + c r — c + b. 

Let us define s = (c r + c + b) /c. It is easy to see that the smaller the n, the better is the code for 
the tree. 

Our last step is to make sure that we will have only part in our partition. In order to do that we 
need to make sure that \V'\ is so large that is not beneficial to have more than 1 parts. In order to 
do this, we set r = 2 kcs+Cr+c+b which will guarantee that we will have only one partition. 
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The optimal partition in G' will now have only one part and will have minimal number of auxil¬ 
iary paths. Since each path corresponds to an edge in G, we can transform this to a Steiner tree 
in G with minimal number of edges. This proves the theorem. | 


13.2 Finding Good Paths: Methods 

Since minimizing L(P, M \ G ) is NP-hard, we are interested in fast approximations. To find fast 
solutions, we will exploit heuristics for the directed Steiner (d-Steiner) tree problem [Charikar 
et al., 1998]: given a directed weighted graph G = (V, E) and a subset of nodes X C V, the 
objective is to find the minimum cost arborescence that spans all the nodes in A", in other words, 
a rooted minimum cost tree that has a directed path from the root to every node in X. The cost 
of a tree is defined as the sum of the costs of the edges in the tree. 

The d-Steiner problem is a well-studied combinatorial optimization problem, for which algo¬ 
rithms have been proposed that provide approximation guarantees [Charikar et al., 1998; Helvig 
et al., 2001; Melkonian, 2007; Zelikovsky, 1997]. However, while providing fairly good bounds, 
these algorithms all require great amounts of computing power, making them intractable for ap¬ 
plication on large graphs. In this section we therefore propose fast heuristics for large graphs. 

Before we proceed with proposed solutions, we first define how we transform the input graph G 
to a directed weighted graph G' , which we use in obtaining a solution. 

Definition 9 Given an undirected unweighted graph G = ( V. E) and a set of marked nodes 
M C V, the transformed graph G' = (V, E') is a directed weighted graph in which each edge 
e(u,v) G E is replaced by two directed edges e'(u,v) G E' and e'(v,u) G E', for which the 
weights w'(u,v ) = logd(-u) and w'(v,u ) = log d(v). 


13.2.1 Proposed Algorithms 

Given the transformed directed weighted graph G', we want to find the set of trees with minimum 
total cost on the marked nodes. Without loss of generality we assume that the marked nodes will 
be the leaf nodes in the resulting trees (if a marked node is a non-leaf node, we can always add 
its copy and connect it to its copy with a zero-cost edge), therefore from hereafter we refer to the 
marked nodes as the terminals. 


Finding Bounded-length Paths 

Most of our proposed methods use short paths between the terminals as a starting point. There¬ 
fore, we first present an algorithm, called FindBoundedPaths, to find multiple short paths of 
up to a certain length (hence, indirectly, up to a certain cost), and in order of increasing length 
between the terminal nodes. The threshold on the length of the paths is not a parameter; as it 
takes log |Vj bits to start a new partition, we set it to log |Vj. 
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Procedure 1: FindBoundedPaths ( G,T ) 

Input: A graph G = (V, E), terminals T C V 

Output: multiple (asc. length) short paths SP from terminals 

1 lengths log( degree (V)) 

2 paths F- 0, pathcosts G- 0, S'/ 5 0, curlen G- 0 

3 foreach t e T do 

4 pa£/js.a,cM({t}), pathcosts .add{lengths(t)) 

s curlen -t— curlen + min(pat/icosts) 

6 while curlen < log(|Vj) do 

7 pathcosts F- pathcosts — min (costs) 

8 foreach v s.t. pathcosts(v) = 0 do 

9 path paths (v) 

10 foreach n e iV„ do 

11 if n ^ path then 

12 ,57Fadd(from: t, to: n, len: curlen, path ) 

13 paf/is.add({pat/i, n}) 

14 pathcosts.add(lengths(n)) 

is pat/i5.remove(u), pathcosts. remove(u) 

16 curlen -t— curlen + min (pathcosts) 

17 return SIP 


FindBoundedPaths employs a BFS-like expansion starting from each terminal until the 
threshold path length is reached. The paths from the terminal to the nodes encountered over 
this expansion as well as their total lengths are stored. A major advantage of our transformed 
graph formulation for the BFS-like expansion is that the cost of all out-edges for a node v are the 
same and equal to log d(v). As a result we only need to keep the length per node, rather than per 
edge, in our BFS-list. 

The pseudo-code is given in Procedure 1. First, lengths is a vector of size | Vj that holds log d(v) 
for each node v e V (line 1). Second, paths is a dynamic list that stores all current list of paths 
during the BFS expansion. Third, pathcosts stores their respective lengths. Last, SP is a struc¬ 
ture list that stores the paths from a terminal to other nodes encountered in the BFS expansion 
(2). The paths are indexed by their end nodes and stored in increasing order of length. 

In each iteration, we expand the node(s) with the minimum length (7). For each such node v 
(8), we expand to its neighbors N v (10). We first store the path information to these neighbors 
(12), and then add the new paths from the root to these neighbors to the paths list (13), and their 
respective lengths to the pathcosts list (14) for further expansion. Note that a node can appear 
multiple times in our BFS lists since each expansion is associated with only a single unique path 
(that is, the path from the root in the BFS tree), and a node may belong to multiple paths. We 
continue iterating, i.e. expanding nodes with minimum length (16) until the threshold length is 
reached (6). 
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At the end of FindBoundedPaths, we have all the paths from every terminal to all the nodes 
in G' that are within a length log \ V\ path, ordered in increasing length. Notice that the expansion 
can be performed completely in parallel for each terminal for speed. 

Next, we give fast approximate solutions for finding a low-cost set of trees on the terminals. 


D OT 2 D OT- Connected Components 

A basic heuristic to partition the terminals is to simply consider the connected components they 
induce on the graph. That is, the terminals that are directly connected are put in the same and 
otherwise separate parts. By definition, all the edges are reciprocated in the transformed graph, 
and therefore the subgraph induced on each part including two or more nodes is not a directed 
acyclic tree (it may as well have cycles of length greater than 2). To find the minimum cost rooted 
directed tree(s) (a.k.a. arborescence), we use the Chu-Liu algorithm [Chu and Liu, 1965]. 


DOT2DOT-MinArborescence 

Our second method (Algorithm 1) uses the transitive closure graph of the terminals to find a 
minimum arborescence. The transitive closure graph G t = (T, E t ) consists of the terminals and 
directed edges e(U,tj) G E t between terminals having weight equal to the shortest path length 
w(ti tj ) from t, to tj, 1 < i,j < \T\. If the shortest paths between all pairs of terminals are 
of length less than log \ V\, then G t is simply a directed clique graph. As we find up to length 
log \V\ paths from every terminal in FindBoundedPaths, a terminal does not have an edge 
in G t to those terminals that are more than this threshold apart from it. 


Algorithm 1: Dot2Dot-MinArborescence 
Input: A graph G = (V, E), terminals T C V 
Output: Partitions P: a Steiner tree on each p G P 

i SP G- FindBoundedPaths (G, T) 

i Construct distance-graph (transitive closure) G t = ( V t , E t , d t ), where V t = T, and for 
every (v^Vj) G E t , d t {v h Vj) = min -len(SP(v u vj)). 

3 Add universal node u which connects to Viy G V t with d t (u,Vi ) = log(|V|). 

4 Find minimum arborescence A of G t . 

s Delete universal node u and its out-edges from A. 

6 return ExpandPartition (A, SP, T) 


Having constructed the transitive closure graph (2), we add a so-called universal node u with 
directed edges e(u, t,) to every terminal L with weight log \ V\ (3). We find the minimum weight 
arborescence A in this new graph (4). Since the universal node u does not have any incoming 
edges, it constitutes the root of A. In fact, the number of out-edges of u in A gives us the number 
of parts |P|, and its sub-trees constitute the partitioning P (5). Next, we replace the edges in each 
of m’s sub-trees with their corresponding paths in the transformed graph G' (6). The expanded 
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sub-trees may contain both marked and unmarked nodes. Also notice that the expanded sub-trees 
might no longer be trees but contain cycles. Therefore, we rerun the arborescence algorithm on 
the expanded sub-trees and remove any unmarked leaf nodes in the resulting arborescences, 
which yields the final forest of Steiner trees. The min-arborescence algorithm takes 0(\V\ 2 ) for 
dense graphs. As we run it on the closure of marked nodes only, we get 0(\M\ 2 ). 


Procedure 2: ExpandPart it ion (R, SP, T ) 

Input: tree(s) R, short paths SP, terminals T 
Output: Partitions P: a Steiner tree on each p G P 

1 Construct subgraph(s) R' by replacing each edge in R by its corresponding shortest path. 

2 Find minimum arborescence A of R!. 

3 Construct Steiner tree(s) S from A by deleting edges in A, if necessary, s.t. all leaves in S 
are terminal (marked) nodes. 

4 return connected components of S as P. 


Algorithm 2: Dot2Dot-1-LevelTree 


Input: A graph G = (V, E), terminals T C V 
Output: Partitions P : a Steiner tree on each p G P 

1 SP i — FindBoundedPaths (G, T) 

2 mincost G- oo 

3 foreach t G T do 

4 N t G- {v G T\SP(t,v) exists} 

5 Construct 1-level tree L 1 = (V p , E p , d p ), where V p — t U N t , and for every 
(t,v G N t ) G E p , d p (t,v ) = min (SP(t,v)). 

6 P G- ExpandPart it ion (Li, SP, T) 

7 if \V P \ < \T\ then 

8 |_ P g- PU Dot2Dot-1-LevelTree (G, T\V p , SP) 


9 

10 

11 

12 


cost cost of partition P by Eq. 13.3 
if cost < mincost then 
j mincost cost, minP P 


i3 return minP 


Dot2Dot- 1-Level Tree 

Our third method (Algorithm 2) builds a (set of) level-1 tree(s) from the transitive closure graph 
on the terminals. Simply put, we try each terminal as the root (3) and connect it to the other 
terminals with shortest paths on the transformed graph G' (4-5). If the selected root does not 
have shorter than length log \ V\ paths to all the terminals, a (set of) level-1 tree(s) is built on 
the remaining terminals (7-8). We return the least-cost (w.r.t. Eq. 13.3) tree(s) as the forest of 
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Steiner trees (10-12). Note that this heuristic algorithm provides a |M(-approximation to the 
optimal solution [Charikar et al., 1998]. 


Algorithm 3: Dot2Dot-£;-LevelTree 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 


Input: A graph G = ( V., E ), terminals T C V 
Output: Partitions P : a Steiner tree on each p e P 

SP i — FindBoundedPaths (G, T) 

Construct candidate-graph G c = (V C) E c ): union of all top-3 shortest paths between all 
pairs of terminals, 
mincost <— oo 

foreach t e T do 

Lk 0 

C\ min-cost 1-level tree(s), one rooted at t 
foreach L\ e C\ do 

S •*— leaves (Li), r -t— root(Li) 

if|5| < 2 then continue 
while S' 7 ^ 0 do 

L\ -t— Part ial/cTree(G c , 5P, r, S', /c) 

Cfc 4— Cfc U L ^ 

S S\leaves(E p k ) 


14 

15 

16 
17 


P <— ExpandPart it ion (Lk, SP, T) 

cost cost of partition P by Eq. 13.3 
if cost < mincost then 
j_ mincost -t— cost, nninP P 


is return rrunP 


DoT2DoT-/c-Level Tree 

Our final method generalizes the 1-level tree heuristic to k -level trees. The goal is to start with a 
(set of) 1-level tree(s) and successively refine each for lower cost. The main idea is the following: 
for a given tree with root r, find one or more intermediate nodes v e V, such that the total cost 
from r to each v plus the costs of sub-trees rooted at v’s (each with a mutual set of terminals as 
leaves) reduces the initial cost. 

More specifically, we construct a k -level tree by a union of sub-trees, each consisting of the root 
r, exactly one intermediate node v, and all the descendants of this intermediate node. We refer 
to such sub-trees of level k as partial k-trees, and we find them in a greedy manner described as 
follows. First, we find a partial k-tree L v k (if any) that reduces the cost for a subset of terminals 
that it spans. Next, we remove the terminals spanned by L p k from consideration and iterate 
this process until all terminals are spanned. Algorithm 3 formally describes the A-lcvcl tree 
heuristic. 
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Procedure 3: Partial/cTree (G c , SP, r, S, k ) 

Input: A graph G c = (V c , E c ), short paths SP, root r, a set S of nodes, level k 
Output: partial k -level tree L\ with intermediate node v G {V, 0} and leaves(L p k ) C S 

1 foreach v G V c do 

2 if k — 2 then 

3 Sort S' = {iVi,..., iV| 5 |} such that 

SP(r, Ni) - SP(v, Ni) > SP(r , N i+1 ) - SP(v , iV m ) 

4 Find j G {1,..., |S|} that minimizes cost(v ) G- SP(r, v) + Yli=i Ni) 
S(v) G- {Ni, ..., Nj} 


6 

7 

8 
9 

10 

11 

12 

13 


S' G- S, G- (r,n), cost(L p k ) <(— oo 

while S' 7 ^ 0 do 

L^_ 1 Partial/cTree(G c , SP, n, S',/c-l) 
if cost(L p k ) < cost(L^_ 1 ) then exit while 

£2 <- £2 u lL, 

S' S'\leaves(L k l ) 
cost(v ) cost(L p ), S(v) G- leaves(L p ) 


14 Find v having minimum cost(v ) 

is return partial A'-lcvcl tree L k with intermediate node n rooted at r and leaves S(n) 


To complete the A'-lcvcl heuristic, we need to describe its subroutine PartialA;Tree which, 
given a root r and a set of terminals S, finds a low-cost partial A;-level tree rooted at r and 
spanning (a subset of) the terminals. The PartialA;Tree heuristic, as given in Procedure 3, 
is recursive: in order to find a partial A;-level tree, we need to first find certain partial (Ac-1)-level 
trees that span all the given terminals (8-12). The base case is reached for level k = 2, which 
works as the following. For each candidate v for the intermediate node, we sort the terminals S 
according to the potential savings of inserting node v between the root r and each terminal in 
S (3). The potential saving for a terminal t, stands for the difference between the shortest path 
lengths w{r t,) and w(v t,). These savings can take positive and negative values and are 
sorted in decreasing order. Finally, we include consecutive terminals from this sorted list while 
their inclusion decreases cost of the partial 2-tree (4-5). 


13.2.2 Discussion 

Before we conclude this section, we discuss some aspects of the A;-level tree heuristic. 

Firstly, note that in order to find good intermediate nodes v. Procedure PartialA;Tree con¬ 
siders all the nodes in the given input graph (1). This raises two issues: 1) we would need the 
shortest paths from the root r to every v as well as from v to its descendants, and 2) the iteration 
would become computationally infeasible even for moderate size graphs. As a result, we pro¬ 
vide PartialA;Tree with the so-called candidate graph G c = (V c , E c ). The candidate graph 
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consists of the union of the nodes and edges in all top-f shortest paths between every pair of 
terminals. Since G c is much smaller than 6" and as we have the shortest paths to the nodes in it 
computed, both issues are addressed. In the experiments we use t — 3, which can be adjusted 
depending on available computational resources. Considering a larger candidate graph may help 
finding lower cost trees while a smaller candidate graph provides lower running time as well as 
better visualization. 

Secondly, notice that the cost of a partial 2-tree is computed by the sum of the shortest paths from 
root r to intermediate node v, and from v to a set of terminals (4). Using FindBoundedPaths 
recall that we find short paths from marked nodes to the nodes within log \ V\ distance. Since v 
might be an unmarked node, we might need the shortest path length from an unmarked node to 
a marked one. We find the shortest path from unmarked nodes in the candidate graph to marked 
nodes using the following Lemma. 

Lemma 6 The shortest path from i to j and the shortest path from j to i for all i,j G V in the 
transformed graph 6" contain exactly the same nodes with edges in reverse directions. Due to 
symmetry, the total weights of the reciprocal shortest paths obey the following equation. 

w{i j ) = w(j i) — log d(j) + logd(i) 


Proof Let {i — > v\ — > — > ■ ■ ■ ih — > j} denote the shortest path from i to j. Notice that any 

path from i to j contains an out-edge of i to one of its neighbors, with weight log dfi). Thus, 
w(i,v i) = logd(i). Let R = Y^k=i w ( v k, v k+i) + w ( v h3 )> such th a t logd(i) + R minimizes 
total length of the path. Since all out-edges of a node v G V are equal, we can also write 
w(vi,i) + w ( v k, Vk-i) = R- So the reverse path from j to i has length log d(j ) + R. As R 
gives the shortest path length from i to j, it also gives the shortest from j to i. Hence the lemma 
holds. | 

As a result, given that we know the shortest paths from terminals to other nodes v G V within 
log \ V\, we can compute the shortest path from any such v to any terminal in constant time using 
Lemma 6. On the other hand, notice that for k > 3 in Partial/cTree, a descendant of an 
unmarked intermediate node might also be unmarked, in which case we would need the shortest 
paths between two unmarked nodes and run FindBoundedPaths starting from every new set 
of intermediate nodes. In our experiments, for speed we restrict ourselves to k = 2. 


13.3 Experimental Results 


In this section, we evaluate our proposed method; first we give intuitive results on synthetic exam¬ 
ples, second we quantitatively compare the performance of the four proposed heuristic methods 
Dot2Dot-* (Components, MinArborescence, 1-, and 2-LevelTree), and finally we 
provide case studies to show qualitative performance on real-world graphs. 
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13.3.1 Synthetic examples 


We start by testing our method through three examples on a synthetic 100 x 100 grid graph, as 
well as one example on the well-known Zackary’s karate graph [Zachary, 1977]. 

First, as shown in Figure 13.3(a), we select 8 marked (blue square) nodes relatively close to each 
other on the grid graph, and run all four of our approximation methods. We highlight (by orange 
bold edges) the minimum description cost tree found (orange nodes: connectors)—notice that it 
provides succinct connections among the marked nodes. Next we place 4 of the marked nodes 
farther apart in the grid graph, in which case our method successfully partitions the marked nodes 
and provides 2 connection trees for each as shown in Figure 13.3(b). 







(d) (e) 

Figure 13.3: Synthetic examples: (a) 8 marked (square) nodes placed close to each other on a 
grid, (b) same 8 nodes in (a) spaced out on the grid, forming 2 parts, (c) connecting 
the dots, (d) recovering missing connector, (e) 7 marked nodes on Karate graph, 
forming 3 parts. 


Second, we place the marked nodes intermittently on the grid as shown in Figure 13.3(c). In¬ 
tuitively, human would connect these dots to form a rectangle—and so does our method. Fig¬ 
ure 13.3(d) shows a set of marked nodes forming an almost full rectangle on the grid, except one 
node in the middle left unmarked. Notice that our method successfully recovers this ‘left-behind’ 
node as a significant connector. 

Finally, we mark 7 nodes on the Karate graph as depicted in Figure 13.3(e). Our method parti¬ 
tions the nodes into 3 parts that are well separated through high degree hub nodes. 
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13.3.2 Comparing the heuristics 


In this section, we aim to understand the average performance of the four approximation methods 
proposed in Section 13.2. To do so, we run simulations on three real graphs and compare their 
average description cost. In each simulated run we test the heuristics on a different set of marked 
nodes. The set of marked nodes are selected via random walk sampling, which we describe next. 
The dataset information is given in Table 13.1. 


Name 


\E\ 

Description 

Netscience 

379 

914 

Author collaborations 

GScholar 

83K 

148K 

Academic article citations 

DBLP 

329K 

1094K 

Author collaborations 


Table 13.1: Dataset summary used for DOT2DOT. 


To select a set of k marked nodes, we follow a random walk sampling scheme. First, we fix 
a sampling rate s. Then, we choose and mark a node at random in the given graph. Next we 
randomly visit k' < k of its neighbors, and mark each neighbor with probability s. We continue 
this process until we have k marked nodes in total. 


GoogleScholar, k=10, runs=10 




DBLP, k=8, runs=10 



Figure 13.4: Comparison of our heuristics: total cost (bits) versus various sampling rates s to 
choose the marked nodes. 


In Figure 13.4, we show average description cost (in bits) versus sampling rate s={0.1,..., 0.9} 
of our simulations {k' is set to 3, k'= 1, 2 gives similar results). We notice comparable perfor¬ 
mance results on all three graphs; the simplest heuristic COMPONENTS provides the costliest, 
while MinArborescence gives the most succinct description among the four methods on av¬ 
erage. In addition, 2-LevelTree provides competitive results to MinArborescence, and 
outperforms 1-LevelTree as would be expected. 

In addition, notice the downward trend of the cost for increasing sampling rate. This is expected, 
as for higher s the marked nodes are chosen among closer nodes. This also explains the relatively 
larger gap in performance between 2-LevelTree and MinArborescence for small s, as for 
farther apart nodes a higher than 2 level tree might be required. 
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Finally, we give the running time of our methods in Figure 13.5. We notice that due to the 
iterative nature of looking for good intermediate nodes, /c-LevelTree heuristics (LI and L2 
respectively) take longer than the others. At the same time, L2 completes in about 50 seconds 
on our largest graph DBLP and can be further sped up by providing it with a smaller candidate 
graph. Since FindBoundedPaths takes the most considerable time in our framework, we 
propose to run all heuristics and report the best result with the minimum cost. 


1000 


Netscience ■ GoogleScholar ■ DBLP 


aj 


a> 


100 


bO 

c 

C 10 







ShortPaths CC 


ARB 


LI 


L2 


Figure 13.5: Run time of proposed methods on three real graphs. 


13.3.3 Case studies on real graphs 

Our method can provide useful insight about the connections and connectors among a group 
of nodes in a given graph. As such, we develop DOT2DOT as an interactive tool that aids in 
visualization and sensemaking. In this section, we provide qualitative analysis on two large 
real-world graphs. 


Authors in DBLP 

We first employ Dot2Dot among authors from various fields in computer science in the DBLP 
dataset. In particular, we select 2 major conferences in certain fields and mark the top 10 authors 
from each, who have the most number of papers appearing at that particular conference. 

In Figure 13.6(a) we show the connection tree our method finds among respective authors from 
VLDB and CHI which are in the fields of databases and human computer interaction, respec¬ 
tively. Notice that the authors from these two fields are considerably well separated in the tree. 
We also observe a connector node among the communities: Duen Homg Chau, a PhD student 
at Carnegie Mellon (also a co-author) whose research focuses on bridging data mining, human 
computer interaction, and visualization. 
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We obtain similar results for the authors from RECOMB (computational biology) and KDD 
(data mining and machine learning) in Figure 13.6(b). Notice the simple connections among 
the authors within the same field and otherwise well separated communities. The connector is 
David Heckerman, the director of the eScience team at Microsoft Research whose work focuses 
on learning from, and analysis of biological and medical data. 

In the next example from DBLP we look at authors from NIPS (machine learning) and PODS 
(database systems), 5 authors from each. DOT2DOT connects the authors from each field through 
a few connectors as shown in Figure 13.6(c). Notice there are two parts in our partitioning; which 
suggests that authors from these two communities are sufficiently apart in the graph. 


Articles in GoogleScholar 

We next apply DOT2DOT to summarize academic articles on certain topics from GScholar. 
By their titles, we mark nodes in the citation graph containing specified keywords. 

In Figure 13.7(a) we show 8 marked articles containing both Targe graphs’ and ‘visual’ keywords 
in their title. Our visualization highlights the resulting partitioning on the candidate graph we 
generate by the union of top-3 shortest paths among the marked nodes. We find that Dot2Dot 
partitions the marked nodes into 4 parts, 3 of which are singletons. The connection tree for the 
largest part provides a concise summary for the 5 nodes in this part, with only two connectors. 
Also notice that the singleton nodes are at least three hops apart from this tree. 

In the second example, we mark a random set of the articles with the keywords ‘association rule’ 
together with ‘visual’ and/or ‘text’ in their title. Here, we expect the articles to naturally split into 
two groups, one group on visualizing association rules and the other on association rule mining in 
text data. Our results, as shown in Figure 13.7(b), agree with our intuition: visualization articles 
are linked up through a meaningful connector and ‘text’ articles form separate multiple parts. It 
is interesting to observe that they can not be glued together by a single connection tree as in the 
former scenario. 
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Figure 13.6: Connection trees among authors from DBLP with most number of papers at spec¬ 
ified conferences, (a) connector Duen Horng Chau: bridging data mining and 
human-computer interaction, (b) connector David Heckerman: mining biomedical 
data, (c) two trees sufficiently apart. 
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Figure 13.7: Connection trees among articles from GScholar with the specified keywords 
in their title, (a) one tree summarizing 5 marked nodes while singletons reside 
farther apart, (b) one tree summarizing ‘visualization’ articles while ‘text’ articles 
are scattered. 
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13.4 Summary of contributions 


In this chapter, we introduced DOT2DOT 1 , a novel method to ‘describe’ a set of marked nodes 
in a graph; by grouping ‘close-by’ nodes together as well as providing ‘concise’ connections 
among the nodes in each group. 

We motivated this problem and our solutions under the setting where the marked nodes are 
thought as anomalous nodes reported by an anomaly detection algorithm, like OddBall. On 
the other hand, we note that our problem framework is general and can find applications in 
various other settings such as gene pathway discovery, product adoption summarization and un¬ 
derstanding of adoption correlation with graph structure, and visualization. 

Our main contributions are listed as the following. 

• Problem formulation: We formali z e the problem of ‘describing a set of chosen nodes in 
a graph’ as an encoding problem. Our formulation exploits the Minimum Description 
Length principle: the best description is the one for which we need the least number of 
bits to encode the marked nodes. As such, our method does not require any user-specified 
parameters such as the number of groups. 

• Fast algorithms: We show that our problem formulation has connections to the directed 
Steiner tree problem [Charikar et al., 1998], and that finding the minimum cost (in bits) 
solution is NP-hard. We propose fast solutions for large graphs. 

• Experiments on real graphs: Experiments on real academic collaboration and citation as 
well as synthetic graphs demonstrate the effectiveness of Dot2Dot in discovering good 
connectors and connections that agree with human intuition. 


'Source code of DOT2DOT: www . cs . emu . edu/~lakoglu/#code 
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Part IV 

Conclusion and Future directions 
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Chapter 14 

Concluding Remarks 


To review, we first summarize our major contributions and then give the impact of this thesis. 
We conclude with possible interesting future research directions. 


14.1 Research Summary and Contributions 

In this thesis, we focused on three interrelated aspects of the study of networks: (1) we analyzed 
the structure and dynamics of real-world networks to understand the regularities they exhibit, (2) 
using the understanding of how networks form and evolve we created several generative models, 
and finally (3) we developed new techniques to spot anomalies in network data in various settings. 
For a list of publications this work appeared in, see Appendix A. 


14.1.1 New Patterns in Network Topology and Human Communications 

In Part I, we analyzed numerous large (millions of nodes and edges) real-world networks from 
many diverse domains including political campaign donations, computer network traffic, blog 
citations, social and Web networks, as well as human communication networks. While previous 
work focused on static unweighted networks, we shifted our focus to dynamic and weighted 
networks. We discovered surprising new patterns, which we briefly review here. We refer to §3.3 
and §4.4 for more details. 

• “Rebel probability” and oscillating/constant-size connected components: We are among 
the first to focus on next-largest connected components (NLCCs) and their dynamics. In 
particular, we showed that (1) NLCCs look l ik e chain graphs; (2) real graphs exhibit a 
“gelling” point at which the small components merge and a giant connected component 
(GCC) emerges; (3) after the gelling point, secondary and tertiary components cannot grow 
beyond a certain size beyond which they get absorbed to the GCC; (4) graph fractal di¬ 
mension of components is stable over time; and (5) rebel probability —that a newcomer 
node will not join the GCC— drops exponentially with its degree and increases linearly 
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with the fraction-size of NLCCs. The “rebel” probability (microscopic dynamics) explains 
the component sizes oscillating around constant-size (macroscopic dynamics). 

• “Fortification effect” and other non-linear weighted-network patterns: We are the 

first to focus on weights on graph edges and their dynamics. In particular, we showed that 
(1) total weight of a node and its degree follow power laws, and weight between two nodes 
follows a gravitational-force-like relation with the total weights of those nodes; (2) total 
weight of a graph grows superlinearly with number of its edges over time, which we call 
fortification; (3) weight additions over time are bursty; and (4) principal eigenvalues and 
number of edges of a graph follow a power law relation over time. 

• Power-laws in human communications: In order to understanding human communica¬ 
tion patterns, we analyzed billions of human communication (phone call, SMS, and instant 
message) records of millions of users over several months. (1) We found that the number 
of maximal cliques a node participates in follows a power-law relation with its degree; 
which translates to “popularity” growing superlinearly with the number of contacts. (2) 
We proposed the 3PL function, which models the reciprocity of user pairs very well, and 
better than bivariate Pareto and Yule. We observed that reciprocity is higher (a) for mutual 
pairs with larger local network overlap, that is, for people with more common friends; and 
(b) for mutual pairs with larger degree-similarity, that is, for people with similar number 
of contacts. (3) We also proposed the TLAC distribution, which fits a vast majority of 
individual phone call durations the best, much better than lognormal and exponential. 


14.1.2 Realistic Generative Models of Networks 

In Part II, we focused on the problem of how to build a generative model that could produce 
realistic-looking synthetic networks. We presented two generators for general graphs that mimic 
a long list of topological properties of real-world networks. We also developed an agent-based 
generator for human communication graphs, that mimic human behavior well. We review these 
models briefly next. We refer to §6.3 and §7.2 for more details. 

• Realistic generative models for graphs: Our first model Recursive Tensor Model (RTM) 
is a simple, recursive generator based on Kronecker tensor multiplication. We rigorously 
proved that RTM produces several desired characteristics, such as bursty traffic, i.e. bursty 
weight additions. On the other hand, it creates multinomials rather than power laws, for 
example in degree distributions. Later, we proposed Random Typing Graphs (RTG) based 
on a simple ‘random typing’ procedure which is shown to mimic natural behavior very 
well. RTG meets all the desirable properties, being realistic (matching all eleven patterns), 
simple, parsimonious, flexible, and fast. 

• Utility-driven model for human communications: We also designed an intuitive graph 
generator PaC, for modeling human communication behavior. In this model, each node 
(a) uses only local information, and (b) uses no randomness, but instead tries to maximize 
a well-defined utility function. This agent-based model allows us to understand the local 
mechanisms forming communication networks and answer what-if scenarios. Based on the 
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utility function of PaC, we can explore what the impact of certain changes (eg. increase in 
price-per-minute, flat call rate) would be on the structure and evolution of the network. 


14.1.3 Tools for Anomaly and Event Detection in Networks 

In Part III, we developed five novel methods for anomaly detection, each addressing the problem 
in a different setting. In particular, the methods address the problem for (1) plain unlabeled 
graphs, (2) binary- and (3) categorical-attributed graphs, (4) time-varying graphs, and finally (5) 
sensemaking and visualization of anomalies. We briefly highlight our contributions below and 
respectively refer to §9.3, §10.3, §11.3, §12.3, and §13.4. 

• Automatic anomaly detection in plain graphs: We proposed a novel method, called 
OddBall, to detect outlier nodes in graph data, rather than in a collection of data points. 
The previous state-of-the-art had several limitations, such as the assumption of the exis¬ 
tence of node labels and ground truth information. We developed our method so as to avoid 
these assumptions and to (1) work for unlabeled and weighted graphs, and (2) operate in a 
completely unsupervised fashion. 

• Parameter-free mining and clustering in attributed graphs: We introduced a novel 
clustering model, called PICS, to find groups of nodes in an attributed graph with (1) sim¬ 
ilar connectivity, and (2) attribute homogeneity. The nodes deviating from the discovered 
patterns correspond to bridge-nodes with connections across clusters or outlier-nodes that 
do not belong well to any cluster. Two key advantages of our method are (1) it requires no 
parameters such as the number of clusters and similarity functions to be tuned, and (2) its 
running time scales linearly with total graph and attribute size. 

We also proposed a new approach, called CompreX, for identifying anomalies in com¬ 
plex attributed graphs, that exploits pattern-based compression. The key features of this 
approach are (1) it builds the models directly from data and requires no parameters to be 
tuned, (2) it generalizes to a broad range of complex data, including graph, image and 
relational databases with various attributes, and (3) it proves effective on a broad range of 
datasets showing large improvements in both compression rate, as well as anomaly detec¬ 
tion accuracy, outperforming its state-of-the-art competitors. 

• Automatic event detection and “attribution” in time-evolving graphs: We developed 
an algorithm to quantify “change” for a graph growing/changing over time that flags those 
time points for which the change is significant as “events”. Two key features of the al¬ 
gorithm are (1) it enables “anomaly attribution”, that is, it points out those nodes that go 
through the highest degree of change, and (2) it works in an online fashion, by efficiently 
updating scores over time. 

• Sensemaking and visualization for anomalies: We introduced a novel solution, called 
Dot2Dot, to visualize and summarize a given set of (anomalous) nodes in a graph. It 
groups ‘close-by’ nodes together and provides ‘concise’ connections among the nodes in 
each group, instead of simply listing them. This provides insight into what coarser groups 
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the nodes form, and thus facilitates summarization. It also shows good connection path¬ 
ways between the given nodes as well as other key ‘connector’ nodes that play a crucial 
role in bridging the connections in between. We built an interactive visualization and ex¬ 
ploration tool that enables end users specify their nodes of interest, visualize the discovered 
connection subgraphs, and explore further the neighborhood subgraphs around the given 
set of nodes on demand. 


14.2 Community Impact 


Our work in this thesis is making broader impact to academia and industry: 

Our RTG generator raised a lot of interest with respect to its ties to earlier history of mathematics 
on random typing, and furthermore won the Best Knowledge Discovery Paper award in the 
Conference on Principles and Practice of Knowledge Discovery (PKDD) 2009. The software 
has been made publicly available and been used in the research community. 

Our OddBall won the Best Paper award in the Pacific-Asia Conference on Knowledge Dis¬ 
covery and Data Mining (PAKDD) 2010. It has also been integrated into Carnegie Mellon’s 
Pegasus Tera-Scale Graph Mining system. Its stand-alone source code is made available and 
has been downloaded from all around the world. It has been a main component of the anomaly 
detection system developed for the ADAMS (Anomaly Detection at Multi Scale) collaborative 
research initiative supported by DARPA, for insider threat detection. 

Our CompreX and its proposed extension to the time-evolving setting led to two patents (filed 
with high-impact score) at IBM Research Labs. Compression has been primarily used in com¬ 
munications theory and databases, for reduced transmission cost and storage cost and increased 
throughput and query performance; our work is one of the few to explore compression for vari¬ 
ous data mining tasks. Moreover, our event detection system contributed to the Network Science 
Collaborative Technology Alliance (NS-CTA) supported by Army Research Labs. 


14.3 Vision and Future Research Directions 


Graphs are powerful and unified tools, effectively capturing long-range relations between the ob¬ 
jects they represent. Not only is the theory of graphs abound with fascinating research problems, 
but also the practical usage of graphs is quite prevalent, as many real-world problems can be 
cast as graph problems. Therefore, I view graph mining as an exciting field to work in, with a 
promising future at the confluence of theory and practice. 

All real data inherently contain patterns; intuitively even complex systems are often largely gov¬ 
erned by a few underlying simpler mechanisms, which introduce considerable redundancy in the 
form of patterns. Therefore, it is intriguing and natural to attempt to find patterns in real data. Be¬ 
ing closely related to pattern discovery, anomaly detection has multi-billion dollar applications 
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in finance (credit card, accounting fraud detection), health care (insurance claim fraud, disease 
outbreak, rare disease detection), security (network intrusion, insider threat detection), biology 
(neuroscience and genetics), and many more. Therefore, understanding how a complex system 
operates and detecting explanatory and emergent patterns (and anomalies therein) will continue 
to be a fascinating area in the future. 

As real-world graphs like social networks and the Web continue growing and the technology to 
collect and store data continue advancing, huge amounts of data will become available. As a 
result scalability will become an even greater challenge. As older algorithms become obsolete 
due to their memory assumptions, mining patterns efficiently and spotting anomalies as early as 
possible in huge collections of data will require novel data mining approaches. 

While I believe that pattern mining, anomaly detection, and scalability are promising research 
areas, finding solutions to domain-specific questions in other fields is also of interest. As the 
semantics captured by graphs differ, so do the problems they pose. Consequently, I believe that 
graph mining has a lot of potential for cross-domain collaborations in a variety of fields. 

In general terms, it is interesting and crucial to develop scalable methods that provide means to 
discover patterns, spot anomalies/events, and to summarize and interpret large complex graphs. 
Below I elaborate on specific future research directions. 

• Patterns and anomalies in complex graphs A vast majority of existing work on graphs 
is designed for plain undirected graphs. Real-world graphs, however, can appear in much 
richer forms (as directed weighted graphs, correlated with other graphs, with node and 
edge attributes, changing over time). For example Facebook data contains user-user in¬ 
teractions, user-wall post-user relations, user-product recommendations, user interests and 
demographics, all correlated and evolving over time. It remains an interesting open prob¬ 
lem to study the inter-relations among multiple correlated graphs that represent more com¬ 
plex data, and to find patterns and anomalies in such complex graphs. 

• Scalability As real-world data keep growing, their graph representations require peta-bytes 
of storage. This makes scalability a real bottleneck in algorithm design. While in the 
past the main goal of most graph mining algorithms was to be effective, the recent trend 
has been to build algorithms that are both effective and efficient on possibly disk-resident 
graphs that do not fit in the main memory. Future research could address this issue (1) 
by developing effective heuristics and ideally theoretically sound approximate algorithms 
which run in sub/linear time for large graphs, and (2) by designing algorithms so as to be 
able to exploit map-reduce type abstractions for parallelism. 

• Graph summarization and understanding One main component of understanding data 
is to visualize it, however with the scale of real graphs visualization has become a real chal¬ 
lenge. A good visualization requires a succinct summary of the graph. Future work could 
address this problem at two levels: (1) macro-level analysis involves summarizing com¬ 
plex graphs at large. For complex multi-facet graphs, the goal is to build novel clustering 
and compression models to offer the users a high-level understanding, and (2) micro-level 
analysis involves summarizing certain parts of interest in the graph (e.g., subsets of nodes, 
small neighborhoods, etc.). 
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Appendix A 

List of publications 


Part I: Patterns of Networks and 
Part II: Generative Models of Networks 

• L. Akoglu. M. McGlohon, and C. Faloutsos. RTM: Laws and a Recursive Generator 
for Weighted Time-Evolving Graphs. IEEE International Conference on Data Mining 
(ICDM), 2008 

• M. McGlohon, L. Akoglu, and C. Faloutsos. Weighted Graphs and Disconnected Com¬ 
ponents: Patterns and a Generator. ACM Special Interest Group on Knowledge Discovery 
and Data Mining (SIG-KDD), 2008 

• N. Du, C. Faloutsos, B. Wang, F. Akoglu. Farge human communication networks: patterns 
and a utility-driven generator. ACM Special Interest Group on Knowledge Discovery and 
Data Mining (SIG-KDD), 2009 

• F. Akoglu, C. Faloutsos. RTG: A Recursive Realistic Graph Generator using Random 
Typing. Data Mining and Knowledge Discovery Journal (DAMI), 2009 

• U Kang, M. McGlohon, F. Akoglu, and C. Faloutsos. Patterns on the Connected Compo¬ 
nents of Terabyte-Scale Graphs. IEEE International Conference on Data Mining (ICDM), 
2010 

• P. O. S. Vaz de Melo, F. Akoglu, C. Faloutsos, A. F. Foureiro. Surprising Patterns for 
the Call Duration Distribution of Mobile Phone Users. European Conference on Ma¬ 
chine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML 
PKDD), 2010 

• F. Akoglu, P. O. S. Vaz de Melo, C. Faloutsos. Quantifying Reciprocity in Farge Weighted 
Communication Networks. Pacific-Asia Conference on Knowledge Discovery and Data 
Mining (PAKDD), 2012 

Part III: Anomaly Detection 

• F. Akoglu, M. McGlohon, C. Faloutsos. OddBall: Spotting Anomalies in Weighted 
Graphs. Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), 
2010 

• F. Akoglu, C. Faloutsos. Event Detection in Time Series of Mobile Communication 
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Graphs. 27th Army Science Conference, 2010 

• L. Akoglu, H. Tong, B. Meeder, C. Faloutsos. PICS: Parameter-free Identification of Co¬ 
hesive Subgroups in Large Attributed Graphs. SIAM International Conference on Data 
Mining (SDM), 2012 

• L. Akoglu, H. Tong, J. Vreeken, C. Faloutsos. CompreX: Fast and Reliable Anomaly De¬ 
tection in Categoric Data. ACM International Conference on Information and Knowledge 
Management (CIKM), 2012 

• L. Akoglu, J. Vreeken, H. Tong, D. H. Chau, C. Faloutsos. Islands and Bridges: Making 
Sense of Marked Nodes in Large Graphs. Carnegie Mellon University Technical Report 
CMU-CS-12-124, 2012 
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